Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Anthony Liguori
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Tue, 14 Sep 2010 07:54:12 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100826 Lightning/1.0b1 Thunderbird/3.0.7

On 09/14/2010 05:46 AM, Stefan Hajnoczi wrote:

On Fri, Sep 10, 2010 at 10:22 PM, Jamie Lokier<address@hidden>  wrote:

Stefan Hajnoczi wrote:

Since there is no ordering imposed between the data write and metadata
update, the following scenarios may occur on crash:
1. Neither data write nor metadata update reach the disk.  This is
fine, qed metadata has not been corrupted.
2. Data reaches disk but metadata update does not.  We have leaked a
cluster but not corrupted metadata.  Leaked clusters can be detected
with qemu-img check.
3. Metadata update reaches disk but data does not.  The interesting
case!  The L2 table now points to a cluster which is beyond the last
cluster in the image file.  Remember that file size is rounded down by
cluster size, so partial data writes are discarded and this case
applies.

Better add:

4. File size is extended fully, but the data didn't all reach the disk.

This case is okay.

If a data cluster does not reach the disk but the file size is
increased there are two outcomes:
1. A leaked cluster if the L2 table update did not reach the disk.
2. A cluster with junk data, which is fine since the guest has no
promise the data safely landed on disk without a completing a flush.

A flush is performed after allocating new L2 tables and before linking
them into the L1 table.  Therefore clusters can be leaked but an
invalid L2 table can never be linked into the L1 table.

5. Metadata is partially updated.
6. (Nasty) Metadata partial write has clobbered neighbouring
   metadata which wasn't meant to be changed.  (This may happen up
   to a sector size on normal hard disks - data is hard to come by.
   This happens to a much larger file range on flash and RAIDs
   sometimes - I call it the "radius of destruction").

6 can also happen when doing the L1 updated mentioned earlier, in
which case you might lose a much larger part of the guest image.

These two cases are problematic.

And not worth the hassle. It might matter if you've bought your C-Gatehard drives from a guy with a blanket on the street and you're sendingyour disk array on the space shuttle during a solar storm, but if you'rebuilding on top of file systems with reasonable storage, these are notreasonable failure scenarios to design for.

There's a place for trying to cover these types of scenarios to buildreliable storage arrays on top of super cheap storage but that's not ourmission. That's what the btrfs's of the world are for.


Regards,

Anthony Liguori

   I've been thinking in atomic sector
updates and not in a model where updates can be partial or even
destructive at the byte level.  Do you have references where I can
read more about the radius of destruction ;)?

Transactional I/O solves this problem.  Checksums can detect but do
not fix the problem alone.  Duplicate metadata together with checksums
could be a solution but I haven't thought through the details.

Any other suggestions?

Time to peek at md and dm to see how they safeguard metadata.

Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
Next by Date: [Qemu-devel] qcow2 performance plan
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread