|
From: | Anthony Liguori |
Subject: | [Qemu-devel] Re: Caching modes |
Date: | Tue, 21 Sep 2010 10:13:01 -0500 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100826 Lightning/1.0b1 Thunderbird/3.0.7 |
On 09/21/2010 09:26 AM, Christoph Hellwig wrote:
On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:O_DIRECT alone to a pre-allocated file on a normal file system should result in the data being visible without any additional metadata transactions.Anthony, for the third time: no. O_DIRECT is a non-portable extension in Linux (taken from IRIX) and is defined as: O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECT does not have any meaning for data integrity, it just tells the filesystem it *should* not use the pagecache. Even if it should not various filesystem have fallbacks to buffered I/O for corner cases. It does *not* mean the actual disk cache gets flushed, and it *does* not guarantee anything about metadata which is very important.
Yes, I understand all of this but I was trying to avoid accepting it. But after the call today, I'm convinced that this is fundamentally a filesystem problem.
I think what we need to do is:1) make virtual WC guest controllable. If a guest enables WC, &= ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user specify the virtual WC mode but it has to be changable during live migration.
2) only let the user choose between using and not using the host page cache. IOW, direct=on|off. cache=XXX is deprecated.
3) make O_DIRECT | O_DSYNC not suck so badly on ext4.
Barriers are a Linux-specific implementation details that is in the process of going away, probably in Linux 2.6.37. But if you want O_DSYNC semantics with a volatile disk write cache there is no way around using a cache flush or the FUA bit on all I/O caused by it.If you have a volatile disk write cache, then we don't need O_DSYNC semantics.If you present a volatile write cache to the guest you do indeed not need O_DSYNC and can rely on the guest sending fdatasync calls when it wants to flush the cache. But for the statement above you can replace O_DSYC with fdatasync and it will still be correct. O_DSYNC in current Linux kernels is nothing but an implicit range fdatasync after each write.
Yes. I was stuck on O_DSYNC being independent of the virtual WC but it's clear to me now that it cannot be.
ext3 and ext4 have really bad fsync implementations. Just use a better filesystem or bug one of it's developers if you want that fixed. But except for disabling the disk cache there is no way to get data integrity without cache flushes (the FUA bit is nothing but an implicit flush).But why are we issuing more flushes than the guest is issuing if we don't have to worry about filesystem metadata (i.e. preallocated storage or physical devices)?Who is "we" and what is workload/filesystem/kernel combination? Specific details and numbers please.
My concern is ext4. With a preallocated file and cache=none as implemented today, performance is good even when barrier=1. If we enable O_DSYNC, performance will plummet. Ultimately, this is an ext4 problem, not a QEMU problem.
Perhaps we can issue a warning if the WC is disabled and we do an fsstat and see that it's ext4 with barriers enabled.
I think it's more common for a user to want to disable a virtual WC because they have less faith in the hypervisor than they have in the underlying storage.
The scenarios I am concerned about:1) User has enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they have enterprise storage but not an UPS for the hypervisor.
2) User does not have enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they don't know what they're doing.
In the case of (1), the answer may be "ext4 sucks, remount with barrier=0" but I think we need to at least warn the user of this.
For (2), again it's probably the user doing the wrong thing because if they don't have enterprise storage, then they shouldn't care about a virtual WC. Practically though, I've seen a lot of this with users.
Regards, Anthony Liguori
[Prev in Thread] | Current Thread | [Next in Thread] |