On Wed, Oct 13, 2021 at 3:13 PM Philipp Tomsich
<philipp.tomsich@vrull.eu> wrote:
I had a much simpler version initially (using 3 x mask/shift/or, for
12 instructions after setup of constants), but took up the suggestion
to optimize based on haszero(v)...
Indeed this appears to not do what we expect, when there's only 0x01
set in a byte.
The less optimized form, with a single constant, that would still do
what we want is:
/* set high-bit for non-zero bytes */
constant = dup_const_tl(MO_8, 0x7f);
tmp = v & constant; // AND
tmp += constant; // ADD
tmp |= v; // OR
/* extract high-bit to low-bit, for each word */
tmp &= ~constant; // ANDC
tmp >>= 7; // SHR
/* multiply with 0xff to populate entire byte where the low-bit is set */
tmp *= 0xff; // MUL
I'll submit a patch with this one later today, once I had a chance to
pass this through a full test.
Thanks for the insight.
I have tried it, implemented as:
```
static void gen_orc_b(TCGv ret, TCGv source1)
{
TCGv tmp = tcg_temp_new();
TCGv constant = tcg_constant_tl(dup_const_tl(MO_8, 0x7f));
/* set high-bit for non-zero bytes */
tcg_gen_and_tl(tmp, source1, constant);
tcg_gen_add_tl(tmp, tmp, constant);
tcg_gen_or_tl(tmp, tmp, source1);
/* extract high-bit to low-bit, for each word */
tcg_gen_andc_tl(tmp, tmp, constant);
tcg_gen_shri_tl(tmp, tmp, 7);
/* Replicate the lsb of each byte across the byte. */
tcg_gen_muli_tl(ret, tmp, 0xff);
tcg_temp_free(tmp);
}
```
It does pass my own test sequences.