[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-discuss] Getting qemu-system-i386 to use more than one core on

From: Peter Maydell
Subject: Re: [Qemu-discuss] Getting qemu-system-i386 to use more than one core on Cortex A7 host
Date: Tue, 5 Jan 2016 23:53:01 +0000

On 5 January 2016 at 23:10, Jakob Bohm <address@hidden> wrote:
> On 05/01/2016 18:35, Peter Maydell wrote:
>> (It would also be possible
>> to use the v8 ARM load-acquire and store-release instructions
>> rather than full on barriers, but on v7 I think barriers are
>> the only answer.)
> The Load acquire/store if no conflict instruction pair was introduced
> halfway through the Armv6 architecture, though it may be missing on
> some non-A Armv7 cores, since it is not required for that processor
> class.

I think you are thinking of the load-exclusive/store-exclusive
instructions, which did indeed appear in ARMv6 and provide
"only store if no conflict" semantics for implementing atomic
operations. Load-acquire/store-release are different and are
new in ARMv8 -- they are a bit like a normal load/store with a
built-in one-sided barrier: if you do a load-acquire then some normal
loads/stores, other CPUs must see your load-acquire before the
other operations (but loads/stores that happened before the
load-acquire might still be ordered after it). Similarly if
you do some loads and stores followed by a store-release then
other processors must see your store-release last.
(I've simplified rather here, see the architecture manual for
the exact semantics.)

> Additionally, I think some ARM MMUs have page or region level
> memory ordering flags, including some flag combinations that break
> normal Arm synchronization instructions.

This is true but not really important for considering QEMU
running on an ARM host -- all the RAM we get from the host
OS will be Normal memory, not Device or Strongly-ordered.

> But anyway, it might be worth allowing the P5 reordering rules on x86
> if that improves the situation.  It might also be worth doing some "is
> the host CPU too aggressively reordering" conditionals both compile
> time and runtime, switching between different TCG multi-core strategies
> depending on the exact host CPU.

I'm not sure how you would test at runtime whether the CPU might
decide to reorder accesses -- I think you have to assume the
worst case imposed by the architecture.

> Another tactic could be to not let more than one virtual core have
> actual access to the same page if at least one of them has write
> access.  So the minority of code that actually does do multi-core data
> updates to the same virtualized memory page and might thus be affected
> by ordering rules would cause the emulator to constantly switch the
> shared page back and forth, while most other code will just run along
> nicely using shared read or exclusive write page accesses.

This is an interesting idea; I guess it would need to be
implemented and benchmarked to see if the overhead on typical
workloads was low enough to make it make sense.

> But in the end if x86 really makes these guarantees even in multi-
> socket setups (more than one physical x86 CPU in a suitable
> motherboard), despite the normal effects of caching, while ARM doesn't,
> that kind of sucks.  Though we shouldn't forget that those are not
> the only 2 architectures involved.

It's just an unfortunate architectural philosophy mismatch,
and ARM and x86 are my usual examples of the two possibilities.
I think most non-x86 architectures go for a weaker memory
model than x86 did, so MIPS and PPC are on the ARM end of
the spectrum.

-- PMM

reply via email to

[Prev in Thread] Current Thread [Next in Thread]