qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail?


From: Jamie Lokier
Subject: Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail?
Date: Tue, 22 Jul 2008 00:47:30 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

Anthony Liguori wrote:
> >My main concern is corruption of the QCOW2 sector allocation map, and
> >subsequently QEMU/KVM breaking or going wildly haywire with that file.
> >
> >With a normal filesystem, sure, there are lots of ways to get
> >corruption when certain events happen.  But you don't lose the _whole_
> >filesystem.
> 
> Sure you can.  If you don't have a battery backed disk cache and are 
> using write-back (which is usually the default), you can definitely get 
> corruption of the journal.  Likewise, under the right scenarios, you 
> will get journal corruption with the default mount options of ext3 
> because it doesn't use barriers.

Well, no, when you get filesystem corruption, you don't lose the whole
filesystem.  If you have 10,000,000 files in 100GB, you might lose a
fraction of it.

Also, you're unlikely to lose anything which you're not writing to at
all at the time.  E.g. the OS installation.

My worry is if I have the same amount of data in a QCOW2, I might lose
all of it including the OS, which seems much harsher.  And there isn't
a way to recover it.

But I don't know if QCOW2 is that sensitive, that's why I'm asking.

> This is very hard to see happen in practice though because these windows 
> are very small--just like with QEMU.

The software-caused windows are non-existent on a modern filesystem
with good practice: barriers, decent disks.

> >My concern is that if the QCOW2 sector allocation map is corrupted by
> >these events, you may lose the _whole_ virtual machine, which can be a
> >pretty big loss.
> >
> >Is the format robust enough to prevent that from being a problem?
> 
> It could be extended to contain a journal.  But that doesn't guarantee 
> that you won't lose data because of your file system failing, that's the 
> point I'm making.

Erm.  I think you're answering a different question to the one I'm asking :-)

If my host filesystem is using ext3 with barriers enabled, and the
block device underlying it supports barriers, then I *never* expect to
see host filesystem corruption even on power failure, unless there is
a blatant hardware fault.

The "operation windows" for corruption are non-existent, they are
eliminated in principle, there is *no* sequence of software events
with a time windows for a sudden power failure or crash which results
in corruption.

I regard that as very robust.

However, I do expect to see corruption in QCOW2 images from killing
the process, or shutting down the host without remembering to shut
down all guests first, or losing power on the host.

It only happens on sector allocation.  But isn't that quite often -
i.e. whenever the image grows?  I find my images grow very often, in
normal usage, until they are approaching the size of the flat format.

And, apart from worrying about software corruption windows in QCOW2,
I'm worried that I'll lost the whole installed operating system,
applications, etc., if QEMU can't then read the QCOW2 image properly.
Rather than just a few files.

> >>you have a file system that supports barriers and barriers 
> >>are enabled by default (they aren't enabled by default with ext2/3)
> >
> >There was recent talk of enabling them by default for ext3.
> 
> It's not going to happen.

You may be right.  This is one more reason why I'm asking myself if
ext3 on my VM hosts is such a smart idea...

ext4, however, has barriers enabled by default since 2.6.26 :-)

> >>you are running QEMU with cache=off to disable host write caching.  
> >
> >Doesn't that use O_DIRECT?  O_DIRECT writes don't use barriers, and
> >fsync() does not deterministically issue a disk barrier if there's no
> >metadata change, so O_DIRECT writes are _less_ safe with disks which
> >have write-cache enabled than using normal writes.
>
> It depends on the filesystem.  ext3 never issues any barriers by default 
> :-)

But even with barrier=1, it doesn't issue them with O_DIRECT writes.

> I would think a good filesystem would issue a barrier after an O_DIRECT 
> write.

For O_SYNC, maybe.  But O_DIRECT: that would be more barriers than
most applications want.  Unnecessary barriers are not cheap,
especially on IDE (see below).

It would be better if fdatasync() issued the barrier if there have
been any O_DIRECT writes since the last barrier, even if there's no
cached data to write.  That gives the app a chance to decide where and
when to have barriers.

Otherwise you can't use O_DIRECT to simulate "filesystem in a file"
with similar performance characteristics as a real filesystem.

This is getting off-topic for qemu-devel though.

> >What about using a partition, such as an LVM volume (so it can be
> >snapshotted without having to take down the VM)?  I'm under the
> >impression there is no way to issue disk barrier flushes to a
> >partition, so that's screwed too.  (Besides, LVM doesn't propagate
> >barrier requests from filesystems either...)
> 
> Unfortunately there is no userspace API to inject barriers in a disk.  
> fdatasync() maybe but that's not the same behavior as a barrier?

It's not the same behaviour in theory, but in Linux they are muddled
together to be the same thing.  I tried clarifying the different on
linux-fsdevel, and just got filesystem developers telling me that
barriers on Linux block devices always imply a flush, not just
ordering, so there's no reason to have different requests.

At the application level, the best you can do with normal files is
wait until your AIO writes return, then issue and wait for fdatasync,
then start more writes.  It seems the same AIO methods could work
equally on a block device, with fdatasync sending a barrier+flush
request to the disk if there have been any preceding writes.  It would
be rather convenient too.

> I don't think IDE supports barriers at all FWIW.  It only has a
> write-back and write-through mode so if you care about data, you
> would have to enable write-through in your guest.

Not quite true.  IDE supports barriers on the host by the host kernel
waiting for writes to complete, issuing an IDE FLUSH WRITE CACHE
command, then allowing later writes to start.  It also uses the FUA
("Force Unit Access") bit to do uncached single-sector writes.  So
it does support barriers in a roundabout way, on nearly all IDE disks.

It makes a difference.  I have some devices with ext3 IDE disks that
get corrupt from time to time in normal usage (which includes pulling
the plug regularly :-), unless I enable barriers or turn off
write-cache.  But turning off write-cache slows them down a lot, and
barriers slows them down only a little, so IDE barriers are good.

> >Does this apply to KVM as well?  I thought KVM had a separate threads
> >for I/O, so problems in another subsystem might crash an I/O thread in
> >mid action.  Is that work in progress?
> 
> Not really.  There is a big lock that prevents two threads from every 
> running at the same time within QEMU.

Oh.  What a curious form of threading :-)

-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]