[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-arm] [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero
From: |
Vijay Kilari |
Subject: |
Re: [Qemu-arm] [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking |
Date: |
Tue, 5 Jul 2016 17:54:18 +0530 |
On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <address@hidden> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47, <address@hidden> wrote:
>>>
>>> From: Vijay <address@hidden>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>> #define SPLAT(p) _mm_set1_epi8(*(p))
>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>
>
> which compiles down to
>
> 1c: 6e211c00 eor v0.16b, v0.16b, v1.16b
> 20: 6eb0a800 umaxv s0, v0.4s
> 24: 1e260000 fmov w0, s0
> 28: 6b1f001f cmp w0, wzr
> 2c: 1a9f17e0 cset w0, eq
> 30: d65f03c0 ret
For me this code compiles as below and migration time is ~100ms more.
See below 3 trails of migration time
7039cc: 6eb0a800 umaxv s0, v0.4s
7039d0: 0e043c02 mov w2, v0.s[0]
7039d4: 350000c2 cbnz w2, 7039ec <f0+0xf4>
7039d8: 91002084 add x4, x4, #0x8
7039dc: 91020063 add x3, x3, #0x80
7039e0: eb01009f cmp x4, x1
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3070 milliseconds
downtime: 55 milliseconds
setup: 4 milliseconds
transferred ram: 300637 kbytes
throughput: 802.49 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062834 pages
skipped: 0 pages
normal: 70489 pages
normal bytes: 281956 kbytes
dirty sync count: 3
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 290277 kbytes
throughput: 775.61 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064185 pages
skipped: 0 pages
normal: 67901 pages
normal bytes: 271604 kbytes
dirty sync count: 3
(qemu)
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 34 milliseconds
setup: 5 milliseconds
transferred ram: 294614 kbytes
throughput: 787.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063365 pages
skipped: 0 pages
normal: 68985 pages
normal bytes: 275940 kbytes
dirty sync count: 3
>
> vs
>
> 34: 4e083c20 mov x0, v1.d[0]
> 38: 4e083c01 mov x1, v0.d[0]
> 3c: eb00003f cmp x1, x0
> 40: 52800000 mov w0, #0
> 44: 54000040 b.eq 4c <f0+0x18>
> 48: d65f03c0 ret
> 4c: 4e183c20 mov x0, v1.d[1]
> 50: 4e183c01 mov x1, v0.d[1]
> 54: eb00003f cmp x1, x0
> 58: 1a9f17e0 cset w0, eq
> 5c: d65f03c0 ret
>
My patch compiles to below code and takes ~100ms less time
#define VECTYPE uint64x2_t
#define ALL_EQ(v1, v2) \
((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
(vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
7039d0: 4e083c02 mov x2, v0.d[0]
7039d4: b5000102 cbnz x2, 7039f4 <f0+0xfc>
7039d8: 4e183c02 mov x2, v0.d[1]
7039dc: b50000c2 cbnz x2, 7039f4 <f0+0xfc>
7039e0: 91002084 add x4, x4, #0x8
7039e4: 91020063 add x3, x3, #0x80
7039e8: eb04003f cmp x1, x4
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2973 milliseconds
downtime: 67 milliseconds
setup: 5 milliseconds
transferred ram: 293659 kbytes
throughput: 809.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062791 pages
skipped: 0 pages
normal: 68748 pages
normal bytes: 274992 kbytes
dirty sync count: 3
(qemu)
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2972 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 295972 kbytes
throughput: 816.10 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062861 pages
skipped: 0 pages
normal: 69325 pages
normal bytes: 277300 kbytes
dirty sync count: 3
(qemu)
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2982 milliseconds
downtime: 40 milliseconds
setup: 5 milliseconds
transferred ram: 293386 kbytes
throughput: 806.26 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063199 pages
skipped: 0 pages
normal: 68679 pages
normal bytes: 274716 kbytes
dirty sync count: 4
(qemu)
Regards
Vijay