qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW fun


From: Anthony Liguori
Subject: Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function
Date: Wed, 16 May 2012 14:39:59 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1

On 05/15/2012 11:35 PM, Benjamin Herrenschmidt wrote:


+    /* HACK: full memory barrier here */
+    __sync_synchronize();

I thought you were going to limit this to the TCE iommu?

So, it wasn't my intention to send this one with the rest, but I
forgot to explain that to Ben when he resent.  As the comment
suggests, this is a hack we have been using internally to see if
certain ordering problems were what we thought they were.  If that
turned out to be the case (and it now looks like it is), we need to
work out where to correctly place this barrier.  As Ben says, this
should probably really be in the PCI accessors, and we should use the
finer grained primitives from qemu-barrier.h rather than the brute
force __sync_synchronize().

Well, I knew you didn't intend to send them but I still think that's the
right patch for now :-)

So we -could- put it in the PCI accessors ... but that would mean fixing
all drivers to actually use them. For example, ide/ahci or usb/ohci
don't and they aren't the only one.

In the end, I don't think there's anything we care about which would not
benefit from ensuring that the DMAs it does appear in the order they
were issued to the guest kernel. Most busses provide that guarantee to
some extent and while some busses do have the ability to explicitly
request relaxed ordering I don't think this is the case with anything we
care about emulating at this stage (and we can always make that separate
accessors or flags to add to the direction for example).

I must confess, I have no idea what PCI et al guarantee with respect to ordering. What's nasty about this patch is that you're not just ordering wrt device writes/reads, but also with the other VCPUs. I don't suspect this would be prohibitively expensive but it still worries me.

So by putting the barrier right in the dma_* accessor we kill all the
birds with one stone without having to audit all drivers for use of the
right accessors and all bus types.

Also while the goal of using more targeted barriers might be worthwhile
in the long run, it's not totally trivial because we do want to order
store vs. subsequent loads in all cases and load vs. loads, and we don't
want to have to keep track of what the previous access was, so at this
stage it's simply easier to just use a full barrier.

So my suggestion is to see if that patch introduces a measurable
performance regression anywhere we care about (ie on x86) and if not,
just go for it, it will solve a very real problem and we can ponder ways
to do it better as a second step if it's worthwhile.

Anthony, how do you usually benchmark these things ? Any chance you can
run a few tests to see if there's any visible loss ?

My concern would really be limited to virtio ring processing. It all depends on where you place the barriers in the end.

I really don't want to just conservatively stick barriers everywhere either. I'd like to have a specific ordering guarantee and then implement that and deal with the performance consequences.

I also wonder if the "fix" that you see from this is papering around a bigger problem. Can you explain the ohci problem that led you to do this in the first place?

Regards,

Anthony Liguori


Cheers,
Ben.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]