qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: Caching modes


From: Anthony Liguori
Subject: [Qemu-devel] Re: Caching modes
Date: Tue, 21 Sep 2010 10:13:01 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100826 Lightning/1.0b1 Thunderbird/3.0.7

On 09/21/2010 09:26 AM, Christoph Hellwig wrote:
On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
O_DIRECT alone to a pre-allocated file on a normal file system should
result in the data being visible without any additional metadata
transactions.
Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:


        O_DIRECT (Since Linux 2.4.10)
               Try  to minimize cache effects of the I/O to and from this file.
               In general this will degrade performance, but it  is  useful  in
               special  situations,  such  as  when  applications  do their own
               caching.  File I/O is done directly to/from user space  buffers.
               The O_DIRECT flag on its own makes at an effort to transfer data
               synchronously, but does not give the guarantees  of  the  O_SYNC
               that  data and necessary metadata are transferred.  To guarantee
               synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
               See NOTES below for further discussion.

               A  semantically  similar  (but  deprecated)  interface for block
               devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Yes, I understand all of this but I was trying to avoid accepting it. But after the call today, I'm convinced that this is fundamentally a filesystem problem.

I think what we need to do is:

1) make virtual WC guest controllable. If a guest enables WC, &= ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user specify the virtual WC mode but it has to be changable during live migration.

2) only let the user choose between using and not using the host page cache. IOW, direct=on|off. cache=XXX is deprecated.

3) make O_DIRECT | O_DSYNC not suck so badly on ext4.

Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37.  But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it.
If you have a volatile disk write cache, then we don't need O_DSYNC
semantics.
If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

Yes. I was stuck on O_DSYNC being independent of the virtual WC but it's clear to me now that it cannot be.

ext3 and ext4 have really bad fsync implementations.  Just use a better
filesystem or bug one of it's developers if you want that fixed.  But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).

But why are we issuing more flushes than the guest is issuing if we
don't have to worry about filesystem metadata (i.e. preallocated storage
or physical devices)?
Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.

My concern is ext4. With a preallocated file and cache=none as implemented today, performance is good even when barrier=1. If we enable O_DSYNC, performance will plummet. Ultimately, this is an ext4 problem, not a QEMU problem.

Perhaps we can issue a warning if the WC is disabled and we do an fsstat and see that it's ext4 with barriers enabled.

I think it's more common for a user to want to disable a virtual WC because they have less faith in the hypervisor than they have in the underlying storage.

The scenarios I am concerned about:

1) User has enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they have enterprise storage but not an UPS for the hypervisor.

2) User does not have enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they don't know what they're doing.

In the case of (1), the answer may be "ext4 sucks, remount with barrier=0" but I think we need to at least warn the user of this.

For (2), again it's probably the user doing the wrong thing because if they don't have enterprise storage, then they shouldn't care about a virtual WC. Practically though, I've seen a lot of this with users.

Regards,

Anthony Liguori



reply via email to

[Prev in Thread] Current Thread [Next in Thread]