[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v5 33/35] target/arm: Implement SVE dot product
From: |
Richard Henderson |
Subject: |
Re: [Qemu-devel] [PATCH v5 33/35] target/arm: Implement SVE dot product (indexed) |
Date: |
Tue, 26 Jun 2018 09:17:52 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 |
On 06/26/2018 08:30 AM, Peter Maydell wrote:
> On 21 June 2018 at 02:53, Richard Henderson
> <address@hidden> wrote:
>> Signed-off-by: Richard Henderson <address@hidden>
>> ---
>> target/arm/helper.h | 5 ++
>> target/arm/translate-sve.c | 18 +++++++
>> target/arm/vec_helper.c | 96 ++++++++++++++++++++++++++++++++++++++
>> target/arm/sve.decode | 8 +++-
>> 4 files changed, 126 insertions(+), 1 deletion(-)
>>
>
>> +void HELPER(gvec_sdot_idx_b)(void *vd, void *vn, void *vm, uint32_t desc)
>> +{
>> + intptr_t i, j, opr_sz = simd_oprsz(desc), opr_sz_4 = opr_sz / 4;
>> + intptr_t index = simd_data(desc);
>> + uint32_t *d = vd;
>> + int8_t *n = vn, *m = vm;
>> +
>> + for (i = 0; i < opr_sz_4; i = j) {
>> + int8_t m0 = m[(i + index) * 4 + 0];
>> + int8_t m1 = m[(i + index) * 4 + 1];
>> + int8_t m2 = m[(i + index) * 4 + 2];
>> + int8_t m3 = m[(i + index) * 4 + 3];
>> +
>> + j = i;
>> + do {
>> + d[j] += n[j * 4 + 0] * m0
>> + + n[j * 4 + 1] * m1
>> + + n[j * 4 + 2] * m2
>> + + n[j * 4 + 3] * m3;
>> + } while (++j < MIN(i + 4, opr_sz_4));
>> + }
>> + clear_tail(d, opr_sz, simd_maxsz(desc));
>> +}
>
> Maybe I'm just half asleep this afternoon, but this is pretty
> confusing -- nested loops where the outer loop's increment
> uses the inner loop's index, and the inner loop's conditions
> depend on the outer loop index...
Yeah, well.
There is an edge case of aa64 advsimd, reusing this same helper,
sdot v0.2s, v1.8b, v0.4b[0]
where m values must be read (and held) before writing d results,
and there are not 16/4=4 elements to process but only 2.
I suppose I could special-case oprsz == 8 in order to simplify
iteration of what is otherwise a multiple of 16.
I thought iterating J from I to I+4 was easier to read than
writing out I+J everywhere. Perhaps not.
>> -DOT_zzz 01000100 1 sz:1 0 rm:5 00000 u:1 rn:5 rd:5
>> +DOT_zzz 01000100 1 sz:1 0 rm:5 00000 u:1 rn:5 rd:5
>> ra=%reg_movprfx
>
> Should this have been in the previous patch ?
Yes, thanks.
r~
[Qemu-devel] [PATCH v5 34/35] target/arm: Enable SVE for aarch64-linux-user, Richard Henderson, 2018/06/20
[Qemu-devel] [PATCH v5 32/35] target/arm: Implement SVE dot product (vectors), Richard Henderson, 2018/06/20
[Qemu-devel] [PATCH v5 31/35] target/arm: Implement SVE fp complex multiply add (indexed), Richard Henderson, 2018/06/20
[Qemu-devel] [PATCH v5 35/35] target/arm: Implement ARMv8.2-DotProd, Richard Henderson, 2018/06/20
Re: [Qemu-devel] [PATCH v5 00/35] target/arm SVE patches, no-reply, 2018/06/21
Re: [Qemu-devel] [PATCH v5 00/35] target/arm SVE patches, Alex Bennée, 2018/06/26