Re: [PATCH 2/2] util: add util function buffer_zero

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] util: add util function buffer_zero_avx512()

From:	Richard Henderson
Subject:	Re: [PATCH 2/2] util: add util function buffer_zero_avx512()
Date:	Thu, 13 Feb 2020 10:20:36 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1

On 2/12/20 11:52 PM, Robert Hoo wrote:
> And initialize buffer_is_zero() with it, when Intel AVX512F is
> available on host.
> 
> This function utilizes Intel AVX512 fundamental instructions which
> perform over previous AVX2 instructions.

Is it not still true that any AVX512 insn will cause the entire cpu package,
not just the current core, to drop frequency by 20%?

As far as I know one should only use the 512-bit instructions when you can
overcome that frequency drop, which seems unlikely in this case.  That said...

> +    if (unlikely(len < 64)) { /*buff less than 512 bits, unlikely*/
> +        return buffer_zero_int(buf, len);
> +    }

First, len < 64 has been eliminated already in select_accel_fn.
Second, len < 256 is not handled properly by the code below...

> +    /* Begin with an unaligned head of 64 bytes.  */
> +    t = _mm512_loadu_si512(buf);
> +    p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
> +    e = (__m512i *)(((uintptr_t)buf + len) & -64);
> +
> +    /* Loop over 64-byte aligned blocks of 256.  */
> +    while (p < e) {
> +        __builtin_prefetch(p);
> +        if (unlikely(_mm512_test_epi64_mask(t, t))) {
> +            return false;
> +        }
> +        t = p[-4] | p[-3] | p[-2] | p[-1];
> +        p += 4;
> +    }
> +
> +    t |= _mm512_loadu_si512(buf + len - 4 * 64);
> +    t |= _mm512_loadu_si512(buf + len - 3 * 64);
> +    t |= _mm512_loadu_si512(buf + len - 2 * 64);
> +    t |= _mm512_loadu_si512(buf + len - 1 * 64);

... because this final sequence loads 256 bytes.

Rather than make a second test vs 256 in buffer_zero_avx512, I wonder if it
would be better to have select_accel_fn do the job.  Have a global variable
buffer_accel_size alongside buffer_accel so there's only one branch
(mis)predict to worry about.

FWIW, something that the compiler should do, but doesn't currently, is use
vpternlogq to perform a 3-input OR.  Something like

    /* 0xfe -> orABC */
    t = _mm512_ternarylogic_epi64(t, p[-4], p[-3], 0xfe);
    t = _mm512_ternarylogic_epi64(t, p[-2], p[-1], 0xfe);

r~

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 0/2] Add AVX512F optimization option and buffer_zero_avx512(), Robert Hoo, 2020/02/13
- [PATCH 1/2] configure: add configure option avx512f_opt, Robert Hoo, 2020/02/13
- [PATCH 2/2] util: add util function buffer_zero_avx512(), Robert Hoo, 2020/02/13
  - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Paolo Bonzini, 2020/02/13
    - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Robert Hoo, 2020/02/13
  - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Richard Henderson <=
    - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Robert Hoo, 2020/02/24
    - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Richard Henderson, 2020/02/24
    - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Robert Hoo, 2020/02/25
    - Re: [PATCH 2/2] util: add util function buffer_zero_avx512(), Richard Henderson, 2020/02/25
- Re: [PATCH 0/2] Add AVX512F optimization option and buffer_zero_avx512(), no-reply, 2020/02/13

Prev by Date: [PATCH v4 16/16] tests: Update the Unit tests
Next by Date: Re: [PATCH] pc-bios/s390x: Pack ResetInfo struct
Previous by thread: Re: [PATCH 2/2] util: add util function buffer_zero_avx512()
Next by thread: Re: [PATCH 2/2] util: add util function buffer_zero_avx512()
Index(es):
- Date
- Thread