[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
From: |
Pranith Kumar |
Subject: |
Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions |
Date: |
Tue, 19 Jul 2016 14:55:15 -0400 |
Paolo Bonzini writes:
> On 14/07/2016 22:29, Pranith Kumar wrote:
>> + } else if (curr_mb_type == TCG_BAR_STRL &&
>> + prev_mb_type == TCG_BAR_LDAQ) {
>> + /* Consecutive load-acquire and store-release barriers
>> + * can be merged into one stronger SC barrier
>> + * ldaq; strl => ld; mb; st
>> + */
>> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> + tcg_op_remove(s, prev_op);
>
> Is this really an optimization? For example the processor could reorder
> "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this
> if you change ldaq1/strl2 to ld1/mb/st2.
>
> On x86 for example a memory fence costs ~50 clock cycles, while normal
> loads and stores are of course faster.
>
> Of course this is useful if your target doesn't have ldaq/strl
> instructions. In this case, however, you probably want to lower ldaq to
> "ld;mb" and strl to "mb;st"; the other optimizations then will remove
> the unnecessary barrier.
>
I agree that this is a conservative optimization. The problem is that
currently even for architectures which have ldaq/strl instructions, tcg
backend does not generate them. TCG just generates plain loads and stores.I
guess we didn't need to since it was single threaded MTTCG.
I am trying to add support to generate these instructions on AARCH64. Once
this is done we can disable the above optimization.
--
Pranith