Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] a

From: Torvald Riegel
Subject: Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
Date: Wed, 19 Jun 2013 22:25:03 +0200

On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 15:15, Torvald Riegel ha scritto:
> >> One reason is that implementing SC for POWER is quite expensive,
> > 
> > Sure, but you don't have to use SC fences or atomics if you don't want
> > them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> > to specify a memory order.  It's perfectly fine to use acquire fences or
> > release fences.  There shouldn't be just one kind of barrier/fence.
> Agreed.  For example Linux uses four: consume (read_barrier_depends),
> acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
> stores are always relaxed, some RMW ops are SC but others are relaxed.
> I want to do something similar in QEMU, with as few changes as possible.
>  In the end I settled for the following:
> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> but RMW shouldn't be that bad.  I don't care if reference counting is a
> little slower than it could be, for example);

I doubt relaxed RMW ops are sufficient even for reference counting.
Typically, the reference counter is used conceptually similar to a lock,
so you need the acquire/release (modulo funny optimizations).  The only
use case that comes to my mind right now for relaxed RMW is really just
statistics counters or such, or cases where you can "re-use" another

> (2) I'd like to have some kind of non-reordering load/store too, either
> SC (which I've improperly referred to as C11/C++11 in my previous email)
> or Java volatile.

Often you probably don't need more than acq/rel, as Paul pointed out.
SC becomes important once you do something like Dekker-style sync, so
cases where you sync via several separate variables to avoid the cache
misses in some common case.  Once you go through one variable in the
end, acq/rel should be fine.

>    [An aside: Java guarantees that volatile stores are not reordered
>    with volatile loads.  This is not guaranteed by just using release
>    stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
   Or maybe Java volatile is acq for loads and seq_cst for stores...
> As long as you only have a producer and a consumer, C11 is fine, because
> all you need is load-acquire/store-release.  In fact, if it weren't for
> the experience factor, C11 is easier than manually placing acquire and
> release barriers.  But as soon as two or more threads are reading _and_
> writing the shared memory, it gets complicated and I want to provide
> something simple that people can use.  This is the reason for (2) above.

I can't quite follow you here.  There is a total order for all
modifications to a single variable, and if you use acq/rel combined with
loads and stores on this variable, then you basically can make use of
the total order.  (All loads that read-from a certain store get a
synchronized-with (and thus happens-before edge) with the store, and the
stores are in a total order.)  This is independent of the number of
readers and writers.  The difference starts once you want to sync with
more than one variable, and need to establish an order between those

> There will still be a few cases that need to be optimized, and here are
> where the difficult requirements come:
> (R1) the primitives *should* not be alien to people who know Linux.
> (R2) those optimizations *must* be easy to do and review; at least as
> easy as these things go.
> The two are obviously related.  Ease of review is why it is important to
> make things familiar to people who know Linux.
> In C11, relaxing SC loads and stores is complicated, and more
> specifically hard to explain!

I can't see why that would be harder than reasoning about equally weaker
Java semantics.  But you obviously know your community, and I don't :)

> I cannot do that myself, and much less
> explain that to the community.  I cannot make them do that.
> Unfortunately, relaxing SC loads and stores is important on POWER which
> has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
> fails both requirements. :(
> By contrast, Java volatile semantics are easily converted to a sequence
> of relaxed loads, relaxed stores, and acq/rel/sc fences.

The same holds for C11/C++11.  If you look at either the standard or the
Batty model, you'll see that for every pair like store(rel)--load(acq),
there is also store(rel)--fence(acq)+load(relaxed),
store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
giving the same semantics.  Likewise for SC.

> It's almost an
> algorithm; I tried to do that myself and succeeded, I could document it
> nicely.  Even better, there are authoritative sources that confirm my
> writing and should be accessible to people who did synchronization
> "stuff" in Linux (no formal models :)).  In this respect, Java satisfies
> both requirements.
> And the loss is limited, since things such as Dekker's algorithm are
> rare in practice.  (In particular, RCU can be implemented just fine with
> Java volatile semantics, but load-acquire/store-release is not enough).

You can also build Dekker with SC stores and acq loads, if I'm not
mistaken.  Typically one would probably use SC fences and relaxed

> [Nothing really important after this point, I think].
> > Note that there is a reason why C11/C++11 don't just have barriers
> > combined with ordinary memory accesses: The compiler needs to be aware
> > which accesses are sequential code (so it can assume that they are
> > data-race-free) and which are potentially concurrent with other accesses
> > to the same data.  [...]
> > you can try to make this very likely be correct by careful
> > placement of asm compiler barriers, but this is likely to be more
> > difficult than just using atomics, which will do the right thing.
> Note that asm is just for older compilers (and even then I try to use
> GCC intrinsics as much as possible).
> On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
> thread fences) to properly annotate references.  rth also suggested that
> I use load/store(relaxed) instead of C volatile.

I agree with rth's suggestion.

> > Maybe the issue that you see with C11/C++11 is that it offers more than
> > you actually need.  If you can summarize what kind of synchronization /
> > concurrent code you are primarily looking at, I can try to help outline
> > a subset of it (i.e., something like code style but just for
> > synchronization).
> Is the above explanation clearer?

Yes, thanks.

> >> I obviously trust Cambridge for
> >> C11/C++11, but their material is very concise or just refers to the
> >> formal model.
> > 
> > Yes, their publications are really about the model.  It's not a
> > tutorial, but useful for reference.  BTW, have you read their C++ paper
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> > or the POPL paper?  The former has more detail (no page limit).
> I know it, but I cannot say I tried hard to understand it.
> > If you haven't yet, I suggest giving their cppmem tool a try too:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> I saw and tried the similar tool for POWER.  The problem with these
> tools, is that they require you to abstract your program into an input
> to the tool.  It works when _writing_ the code, but not when reviewing it.

I agree that it isn't a model checker for existing programs (but that
would likely be quite slow :)).  But it can help people to learn the

> > I guess so.  But you also have to consider the legacy that you create.
> > I do think the C11/C++11 model will used widely, and more and more
> > people will used to it.
> I don't think many people will learn how to use the various non-seqcst
> modes...  At least so far I punted. :)

But you already use similarly weaker orderings that the other
abstractions provide (e.g., Java), so you're half-way there :)


