[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-arm] [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for T
From: |
Vijay Kilari |
Subject: |
Re: [Qemu-arm] [Qemu-devel] [RFC PATCH v1 2/2] utils: Add prefetch for Thunderx platform |
Date: |
Fri, 12 Aug 2016 17:02:00 +0530 |
On Sat, Aug 6, 2016 at 3:47 PM, Richard Henderson <address@hidden> wrote:
> On 08/02/2016 03:50 PM, address@hidden wrote:
>>
>> +#define VEC_PREFETCH(base, index) \
>> + asm volatile ("prfm pldl1strm, [%x[a]]\n" : :
>> [a]"r"(&base[(index)]))
>
>
> Is this not __builtin_prefetch(base + index) ?
>
> I.e. you can defined this generically for all targets.
__builtin_prefetch() is available only in gcc 5.3 for arm64.
>
>> +#if defined (__aarch64__)
>> + do_prefetch = is_thunder_pass2_cpu();
>> + if (do_prefetch) {
>> + VEC_PREFETCH(p, 8);
>> + VEC_PREFETCH(p, 16);
>> + VEC_PREFETCH(p, 24);
>> + }
>> +#endif
>> +
>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>> i < len / sizeof(VECTYPE);
>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>> +
>> +#if defined (__aarch64__)
>> + if (do_prefetch) {
>> + VEC_PREFETCH(p, i+32);
>> + VEC_PREFETCH(p, i+40);
>> + }
>> +#endif
>> +
>
>
> Surely we shouldn't be adding a conditional around a prefetch inside of the
> inner loop. Does it really hurt to perform this prefetch for all aarch64
> cpus?
prefetch is only required for thunderx pass2 platform. I will remove this
condition check.
>
> I'll note that you're also prefetching too much, off the end of the block,
> and that you're probably not prefetching far enough. You'd need to break
> off the last iteration(s) of the loop.
>
> I'll note that you're also prefetching too close. The loop operates on
> 8*vecsize units. In the case of aarch64, 128 byte units. Both i+32 and
128 unit is specific to thunder. I will move this to thunder
specific function
> i+40 are within the current loop. I believe you want to prefetch at
>
I am dropping i+40
> i + BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * N
>
> where N is the number of iterations in advance to be fetched. Probably N is
> 1 or 2, unless the memory controller is really slow.
>
>
> r~