qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions


From: Pranith Kumar
Subject: Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
Date: Tue, 19 Jul 2016 14:55:15 -0400

Paolo Bonzini writes:

> On 14/07/2016 22:29, Pranith Kumar wrote:
>> +            } else if (curr_mb_type == TCG_BAR_STRL &&
>> +                       prev_mb_type == TCG_BAR_LDAQ) {
>> +                /* Consecutive load-acquire and store-release barriers
>> +                 * can be merged into one stronger SC barrier
>> +                 * ldaq; strl => ld; mb; st
>> +                 */
>> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> +                tcg_op_remove(s, prev_op);
>
> Is this really an optimization?  For example the processor could reorder
> "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2".  It cannot do this
> if you change ldaq1/strl2 to ld1/mb/st2.
>
> On x86 for example a memory fence costs ~50 clock cycles, while normal
> loads and stores are of course faster.
>
> Of course this is useful if your target doesn't have ldaq/strl
> instructions.  In this case, however, you probably want to lower ldaq to
> "ld;mb" and strl to "mb;st"; the other optimizations then will remove
> the unnecessary barrier.
>

I agree that this is a conservative optimization. The problem is that
currently even for architectures which have ldaq/strl instructions, tcg
backend does not generate them. TCG just generates plain loads and stores.I
guess we didn't need to since it was single threaded MTTCG.

I am trying to add support to generate these instructions on AARCH64. Once
this is done we can disable the above optimization.

-- 
Pranith



reply via email to

[Prev in Thread] Current Thread [Next in Thread]