qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] util: add util function buffer_zero_avx512()


From: Robert Hoo
Subject: Re: [PATCH 2/2] util: add util function buffer_zero_avx512()
Date: Mon, 24 Feb 2020 15:07:46 +0800

Thanks Richard:-)
Sorry for late reply.
On Thu, 2020-02-13 at 10:20 -0800, Richard Henderson wrote:
> On 2/12/20 11:52 PM, Robert Hoo wrote:
> > And initialize buffer_is_zero() with it, when Intel AVX512F is
> > available on host.
> > 
> > This function utilizes Intel AVX512 fundamental instructions which
> > perform over previous AVX2 instructions.
> 
> Is it not still true that any AVX512 insn will cause the entire cpu
> package,
> not just the current core, to drop frequency by 20%?
> 
> As far as I know one should only use the 512-bit instructions when
> you can
> overcome that frequency drop, which seems unlikely in this
> case.  That said...
> I don't think so. AVX512 has been applied in various places.
> > +    if (unlikely(len < 64)) { /*buff less than 512 bits,
> > unlikely*/
> > +        return buffer_zero_int(buf, len);
> > +    }
> 
> First, len < 64 has been eliminated already in select_accel_fn.
> Second, len < 256 is not handled properly by the code below...
> 
Right. I'm going to fix this in v2.
> 
> > +    /* Begin with an unaligned head of 64 bytes.  */
> > +    t = _mm512_loadu_si512(buf);
> > +    p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
> > +    e = (__m512i *)(((uintptr_t)buf + len) & -64);
> > +
> > +    /* Loop over 64-byte aligned blocks of 256.  */
> > +    while (p < e) {
> > +        __builtin_prefetch(p);
> > +        if (unlikely(_mm512_test_epi64_mask(t, t))) {
> > +            return false;
> > +        }
> > +        t = p[-4] | p[-3] | p[-2] | p[-1];
> > +        p += 4;
> > +    }
> > +
> > +    t |= _mm512_loadu_si512(buf + len - 4 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 3 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 2 * 64);
> > +    t |= _mm512_loadu_si512(buf + len - 1 * 64);
> 
> ... because this final sequence loads 256 bytes.
> 
> Rather than make a second test vs 256 in buffer_zero_avx512, I wonder
> if it
> would be better to have select_accel_fn do the job.  Have a global
> variable
> buffer_accel_size alongside buffer_accel so there's only one branch
> (mis)predict to worry about.
> 
Thanks Richard, very enlightening!
Inspired by your suggestion, I'm thinking go further: use immediate
rather than a global variable, so that saves 1 memory(/cache) access. 

#ifdef CONFIG_AVX512F_OPT   
#define OPTIMIZE_LEN    256
#else
#define OPTIMIZE_LEN    64
#endif
> FWIW, something that the compiler should do, but doesn't currently,
> is use
> vpternlogq to perform a 3-input OR.  Something like
> 
>     /* 0xfe -> orABC */
>     t = _mm512_ternarylogic_epi64(t, p[-4], p[-3], 0xfe);
>     t = _mm512_ternarylogic_epi64(t, p[-2], p[-1], 0xfe);
> 
Very enlightening. Yes, seems compiler doesn't do this.
I tried explicitly use this, however, looks it will have more
instructions generated, and unit test shows it performs less than then
conventional code.
Let me keep the conventional code for this moment, will ask around and
dig further outside this patch.

> 
> r~




reply via email to

[Prev in Thread] Current Thread [Next in Thread]