Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG

From:	Richard Henderson
Subject:	Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG
Date:	Tue, 03 Sep 2013 08:11:15 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8

On 09/02/2013 04:42 PM, Aurelien Jarno wrote:
> On Wed, Aug 28, 2013 at 08:26:43AM -0700, Richard Henderson wrote:
>> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
>> clean up a number of places in several translators where we have to load to 
>> _tl
>> and then truncate or extend to an explicit size.
> 
> I guess you mean there that it would still be possible to do a
> qemu_ld32u for a _i64 size? 

Of course.

> Also it should be the moment to clean the big mess with qemu_ld32 for
> 32-bit guests vs qemu_ld32/qemu_ld32u/qemu_ld32s for 64-bit guests.

Yes, I would think that would happen more or less automatically by having
the separate _i32 and _i64 opcodes.

>> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
>> guest does a byte-swapped store, there's little point in doing two offsetting
>> bswaps to make that happen.
> 
> That's indeed something which would be nice to fix. This is also the
> case of powerpc which has a byte-swapped ld/st instruction.

And s390, and x86...  ;-)

>> (3) For hosts that do not support byte-swapped loads and stores themselves, 
>> the
>> need to allocate extra registers during the memory operation in order to  
>> hold
>> the swapped results is an unnecessary burden.  Better to expose the bswap
>> operation at the tcg opcode level and let normal register allocation happen.
> 
> I don't fully agree with that point. For load ops, the byte swap is
> basically done in place, not using any additional register. For store
> ops, the bswap has to be done before, but if the value to be stored is
> not used later in the TB, no additional register is used.

And if split apart, the bswap would still generally be done in place, just
because that's what register allocation will tend to do.

Consider the register constraints on the store op.  E.g. arm:

    /* qemu_st address & data_reg */
    case 's':
        ct->ct |= TCG_CT_REG;
        tcg_regset_set32(ct->u.regs, 0, (1 << TCG_TARGET_NB_REGS) - 1);
        /* r0-r2 will be overwritten when reading the tlb entry (softmmu only)
           and r0-r1 doing the byte swapping, so don't use these. */
        tcg_regset_reset_reg(ct->u.regs, TCG_REG_R0);
        tcg_regset_reset_reg(ct->u.regs, TCG_REG_R1);

If we split out the bswap, then user mode doesn't have to reserve r0/r1,
the register allocator will just DTRT.  Similarly for i386.

> I think this can be done quite quickly, as the conversion is basically a
> matter of find and replace while you have identified if _tl is _i32 or
> _i64. That would left a few non-optimized cases like loads with _i32 
> later converted to _i64, but that's better than having two interfaces.

Fair enough.

> We should probably define short constants to define that, and why not
> the 32 possible constants in a few letters.

Sure.

> Also the question is do we want to use little-endian/big-endian or
> native-endian/cross-endian?

Both have their appeal.  Although perhaps with appropriate defines, we can have
the best of both.

#define LDST_BSWAP  8

#ifdef HOST_BIG_ENDIAN
# define LDST_LE  LDST_BSWAP
# define LDST_BE  0
#else
# define LDST_LE  0
# define LDST_BE  LDST_BSWAP
#endif

>>      if (tlb hit) {
>>             t = bswap(data);
>>             store t;
>>         } else {
>>             helper_store_be(data);
>>         }
>>
>> If we hoist the bswap it'll need to be
>>
>>      t = bswap(data);
>>      if (tlb hit) {
>>          store t;
>>      } else {
>>          helper_store_le(t);
>>      }
> 
> I am not sure it is worth adding additional complexity (and difference
> between targets) there, it's basically writing the code corresponding
> to qemu_ld* code in each TCG target.

I was just illustrating the reason for more helpers.


r~

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG, Aurelien Jarno, 2013/09/02
- Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG, Richard Henderson <=

Prev by Date: Re: [Qemu-devel] [Qemu-stable][PATCH] rdma: fix multiple VMs parallel migration
Next by Date: Re: [Qemu-devel] [PULL 0/6] s390: cleanups and fixes
Previous by thread: Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG
Next by thread: Re: [Qemu-devel] [Qemu-stable][PATCH] rdma: fix multiple VMs parallel migration
Index(es):
- Date
- Thread