[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
From: |
Peter Lieven |
Subject: |
Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations |
Date: |
Tue, 26 Mar 2013 09:14:51 +0100 |
Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <address@hidden>:
>
> Hmm, right. What about just processing the first few longs twice, i.e.
> the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
> += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?
I tested this version as v3:
size_t buffer_find_nonzero_offset_v3(const void *buf, size_t len)
{
VECTYPE *p = (VECTYPE *)buf;
unsigned long *tmp = (unsigned long *)buf;
VECTYPE zero = ZERO_SPLAT;
size_t i;
assert(can_use_buffer_find_nonzero_offset(buf, len));
if (!len) {
return 0;
}
if (tmp[0]) {
return 0;
}
if (tmp[1]) {
return 1 * sizeof(unsigned long);
}
if (tmp[2]) {
return 2 * sizeof(unsigned long);
}
if (tmp[3]) {
return 3 * sizeof(unsigned long);
}
for (i = 0; i < len / sizeof(VECTYPE);
i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
VECTYPE tmp0 = p[i + 0] | p[i + 1];
VECTYPE tmp1 = p[i + 2] | p[i + 3];
VECTYPE tmp2 = p[i + 4] | p[i + 5];
VECTYPE tmp3 = p[i + 6] | p[i + 7];
VECTYPE tmp01 = tmp0 | tmp1;
VECTYPE tmp23 = tmp2 | tmp3;
if (!ALL_EQ(tmp01 | tmp23, zero)) {
break;
}
}
return i * sizeof(VECTYPE);
}
For reference this is v2:
size_t buffer_find_nonzero_offset_v2(const void *buf, size_t len)
{
VECTYPE *p = (VECTYPE *)buf;
VECTYPE zero = ZERO_SPLAT;
size_t i;
assert(can_use_buffer_find_nonzero_offset(buf, len));
if (!len) {
return 0;
}
for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
if (!ALL_EQ(p[i], zero)) {
return i * sizeof(VECTYPE);
}
}
for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
i < len / sizeof(VECTYPE);
i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
VECTYPE tmp0 = p[i + 0] | p[i + 1];
VECTYPE tmp1 = p[i + 2] | p[i + 3];
VECTYPE tmp2 = p[i + 4] | p[i + 5];
VECTYPE tmp3 = p[i + 6] | p[i + 7];
VECTYPE tmp01 = tmp0 | tmp1;
VECTYPE tmp23 = tmp2 | tmp3;
if (!ALL_EQ(tmp01 | tmp23, zero)) {
break;
}
}
return i * sizeof(VECTYPE);
}
I ran 3*2 tests. Each with 1GB memory and 256 iterations of checking each 4k
page for zero.
1) all pages zero
a) SSE2
is_zero_page: res=67108864 (ticks 3289 user 1 system)
is_zero_page_v2: res=67108864 (ticks 3326 user 0 system)
is_zero_page_v3: res=67108864 (ticks 3305 user 3 system)
is_dup_page: res=67108864 (ticks 3648 user 1 system)
b) unsigned long arithmetic
is_zero_page: res=67108864 (ticks 3474 user 3 system)
is_zero_page_2: res=67108864 (ticks 3516 user 1 system)
is_zero_page_3: res=67108864 (ticks 3525 user 3 system)
is_dup_page: res=67108864 (ticks 3826 user 4 system)
2) all pages non-zero, but first 64-bit of each page zero
a) SSE2
is_zero_page: res=0 (ticks 251 user 0 system)
is_zero_page_v2: res=0 (ticks 87 user 0 system)
is_zero_page_v3: res=0 (ticks 91 user 0 system)
is_dup_page: res=0 (ticks 82 user 0 system)
b) unsigned long arithmetic
is_zero_page: res=0 (ticks 209 user 0 system)
is_zero_page_v2: res=0 (ticks 89 user 0 system)
is_zero_page_v3: res=0 (ticks 88 user 0 system)
is_dup_page: res=0 (ticks 88 user 0 system)
3) all pages non-zero, but first 256-bit of each page zero
a)
is_zero_pages: res=0 (ticks 260 user 0 system)
is_zero_pages_2: res=0 (ticks 199 user 0 system)
is_zero_pages_3: res=0 (ticks 342 user 0 system)
is_dup_pages: res=0 (ticks 223 user 0 system)
b) unsigned long arithmetic
is_zero_pages: res=0 (ticks 230 user 0 system)
is_zero_pages_2: res=0 (ticks 194 user 0 system)
is_zero_pages_3: res=0 (ticks 280 user 0 system)
is_dup_pages: res=0 (ticks 191 user 0 system)
---
is_zero_page is the version from patch set v4.
is_zero_page_2 is checking the first 8 * sizeof(VECTYPE) chunks one by one and
than continuing 8 chunks at once without double-checks
is_zero_page_3 is the above version.
is_dup_page the old implementation.
All compiled with gcc -O3
If noone objects I would use is_zero_page_2 and continue with v5 of the patch
set. As I am
ooo for the next 8 days from tomorrow. i prefer v3 as it has better performance
if the non-zeroness
is within the 8*sizeof(VECTYPE) bytes and not in the first 256-bit.
Paolo, with the version that has lower setup costs in mind shall I use the
vectorized or the unrolled version of patch 4 (find_next_bit optimization)?
Peter
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, (continued)
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Paolo Bonzini, 2013/03/22
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/23
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Paolo Bonzini, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Paolo Bonzini, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Paolo Bonzini, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Peter Lieven, 2013/03/25
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations,
Peter Lieven <=
- Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations, Paolo Bonzini, 2013/03/26