qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions
Date: Tue, 27 Aug 2019 12:14:41 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0

On 8/27/19 2:37 AM, Stefan Brankovic wrote:
> +    for (i = 0; i < 2; i++) {
> +        if (i == 0) {
> +            /* Get high doubleword of vB in avr. */
> +            get_avr64(avr, VB, true);
> +        } else {
> +            /* Get low doubleword of vB in avr. */
> +            get_avr64(avr, VB, false);
> +        }
> +        /*
> +         * Perform count for every byte element using tcg_gen_clzi_i64.
> +         * Since it counts leading zeros on 64 bit lenght, we have to move
> +         * ith byte element to highest 8 bits of tmp, or it with mask(so we 
> get
> +         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and 
> move
> +         * it's result in appropriate byte element of result.
> +         */
> +        tcg_gen_shli_i64(tmp, avr, 56);
> +        tcg_gen_or_i64(tmp, tmp, mask);
> +        tcg_gen_clzi_i64(result, tmp, 64);
> +        for (j = 1; j < 7; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
> +            tcg_gen_or_i64(tmp, tmp, mask);
> +            tcg_gen_clzi_i64(tmp, tmp, 64);
> +            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
> +        }
> +        tcg_gen_or_i64(tmp, avr, mask);
> +        tcg_gen_clzi_i64(tmp, tmp, 64);
> +        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
> +        if (i == 0) {
> +            /* Place result in high doubleword element of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        } else {
> +            /* Place result in low doubleword element of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }

By my count, 60 non-move operations.  This is too many to inline.

Moreover, unlike vpkpx, which I can see being used for graphics
format conversion in old operating systems (who else uses 16-bit
graphics formats now?), I would be very surprised to see vclzb
or vclzh being used frequently.

How did you determine that these instructions needed optimization?

I can see wanting to apply

--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1817,8 +1817,8 @@ VUPK(lsw, s64, s32, UPKLO)
         }                                                               \
     }

-#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
-#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
+#define clzb(v) clz32(((uint32_t)(v) << 24) | 0x00ffffffu)
+#define clzh(v) clz32(((uint32_t)(v) << 16) | 0x0000ffffu)

 VGENERIC_DO(clzb, u8)
 VGENERIC_DO(clzh, u16)

as the cmov instruction required by the current implementation is going to be
quite a bit slower than the OR instruction.

And similarly for ctzb() and ctzh().


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]