qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [PATCH 18/42] target/arm: Convert VFP VMLA to decodetree


From: Richard Henderson
Subject: Re: [Qemu-arm] [PATCH 18/42] target/arm: Convert VFP VMLA to decodetree
Date: Sat, 8 Jun 2019 09:14:59 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0

On 6/6/19 12:45 PM, Peter Maydell wrote:
> Convert the VFP VMLA instruction to decodetree.
> 
> This is the first of the VFP 3-operand data processing instructions,
> so we include in this patch the code which loops over the elements
> for an old-style VFP vector operation. The existing code to do this
> looping uses the deprecated cpu_F0s/F0d/F1s/F1d TCG globals; since
> we are going to be converting instructions one at a time anyway
> we can take the opportunity to make the new loop use TCG temporaries,
> which means we can do that conversion one operation at a time
> rather than needing to do it all in one go.
> 
> We include an UNDEF check which was missing in the old code:
> short-vector operations (with stride or length non-zero) were
> deprecated in v7A and must UNDEF in v8A, so if the MVFR0 FPShVec
> field does not indicate that support for short vectors is present
> we UNDEF the operations that would use them. (This is a change
> of behaviour for Cortex-A7, Cortex-A15 and the v8 CPUs, which
> previously were all incorrectly allowing short-vector operations.)
> 
> Note that the conversion fixes a bug in the old code for the
> case of VFP short-vector "mixed scalar/vector operations". These
> happen where the destination register is in a vector bank but
> but the second operand is in a scalar bank. For example
>   vmla.f64 d10, d1, d16   with length 2 stride 2
> is equivalent to the pair of scalar operations
>   vmla.f64 d10, d1, d16
>   vmla.f64 d8, d3, d16
> where the destination and first input register cycle through
> their vector but the second input is scalar (d16). In the
> old decoder the gen_vfp_F1_mul() operation uses cpu_F1{s,d}
> as a temporary output for the multiply, which trashes the
> second input operand. For the fully-scalar case (where we
> never do a second iteration) and the fully-vector case
> (where the loop loads the new second input operand) this
> doesn't matter, but for the mixed scalar/vector case we
> will end up using the wrong value for later loop iterations.
> In the new code we use TCG temporaries and so avoid the bug.
> This bug is present for all the multiply-accumulate insns
> that operate on short vectors: VMLA, VMLS, VNMLA, VNMLS.
> 
> Note 2: the expression used to calculate the next register
> number in the vector bank is not in fact correct; we leave
> this behaviour unchanged from the old decoder and will
> fix this bug later in the series.
> 
> Signed-off-by: Peter Maydell <address@hidden>
> ---
>  target/arm/cpu.h               |   5 +
>  target/arm/translate-vfp.inc.c | 205 +++++++++++++++++++++++++++++++++
>  target/arm/translate.c         |  14 ++-
>  target/arm/vfp.decode          |   6 +
>  4 files changed, 224 insertions(+), 6 deletions(-)

Reviewed-by: Richard Henderson <address@hidden>


r~




reply via email to

[Prev in Thread] Current Thread [Next in Thread]