qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 26/51] target/arm: Implement SME LD1, ST1


From: Richard Henderson
Subject: Re: [PATCH v3 26/51] target/arm: Implement SME LD1, ST1
Date: Thu, 23 Jun 2022 13:36:42 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

On 6/23/22 04:41, Peter Maydell wrote:
+/*
+ * FIXME: The ARMVectorReg elements are stored in host-endian 64-bit units.
+ * We do not have a defined ordering of the 64-bit units for host-endian
+ * 128-bit quantities.  For now, just leave the host words in little-endian
+ * order and hope for the best.
+ */

I don't understand this comment. The architecture ought to specify
what order the two halves of a 128-bit quantity ought to go in the
ZArray storage. If we can't guarantee that a host int128_t can be
handled in a way that does the right thing, then we just can't
use int128_t.

Indeed. The spec is an array of bits, and the way the indexing is done is equal to putting the two 64-bit quantities in little-endian order in our array. I'll improve the comment.


+static inline QEMU_ALWAYS_INLINE
+void sme_ld1(CPUARMState *env, void *za, uint64_t *vg,
+             const target_ulong addr, uint32_t desc, const uintptr_t ra,
+             const int esz, uint32_t mtedesc, bool vertical,
+             sve_ldst1_host_fn *host_fn,
+             sve_ldst1_tlb_fn *tlb_fn,
+             ClearFn *clr_fn,
+             CopyFn *cpy_fn)

We now have several rather long functions that are
pretty complicated and pretty similar handling the various
SVE and SME loads and stores. Is there really no hope for
sharing code ?

I'm not sure.  Maybe.  The two issues are:

(1) sve ld4/st4 -- arm didn't make z29-z31 illegal, but defined wraparound to z0. So I pass in a Zreg number not a pointer to those routines. So the routines can't be reused for Zarray without changing that.

(2) sme ld1/st1 vertical slice, which significantly alters the spacing between 
the elements.

One possibility for these cases is to perform the load into env->some_scratch_space, then copy into the final position afterward, and the reverse for stores.

Is that preferable?  Do you see another alternative?


+    t_za = get_tile_rowcol(s, a->esz, a->rs, a->za_imm, a->v);
+    t_pg = pred_full_reg_ptr(s, a->pg);
+    addr = tcg_temp_new_i64();
+
+    tcg_gen_shli_i64(addr, cpu_reg(s, a->rn), a->esz);
+    tcg_gen_add_i64(addr, addr, cpu_reg_sp(s, a->rm));

Aren't rm and rn the wrong way around here?  The pseudocode
says that rn is the base (can be SP, not scaled) and rm is
the offset (can be XZR, scaled by mbytes).

Yep, thanks.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]