qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format


From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Wed, 08 Sep 2010 08:26:00 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/08/2010 08:20 AM, Kevin Wolf wrote:
Am 08.09.2010 14:48, schrieb Anthony Liguori:
On 09/08/2010 03:23 AM, Avi Kivity wrote:
  On 09/08/2010 01:27 AM, Anthony Liguori wrote:
FWIW, L2s are 256K at the moment and with a two level table, it can
support 5PB of data.

I clearly suck at basic math today.  The image supports 64TB today.
Dropping to 128K tables would reduce it to 16TB and 64k tables would
be 4TB.
Maybe we should do three levels then.  Some users are bound to
complain about 64TB.
That's just the default size.  The table size and cluster sizes are
configurable.  Without changing the cluster size, the image can support
up to 1PB.

BTW, I don't think your checksumming idea is sound.  If you store a
64-bit checksum along side each point, it becomes necessary to update
the parent pointer every time the table changes.   This introduces an
ordering requirement which means you need to sync() the file every
time you update and L2 entry.
Even worse, if the crash happens between an L2 update and an L1
checksum update, the entire cluster goes away.  You really want
allocate-on-write for this.

Today, we only need to sync() when we first allocate an L2 entry
(because their locations never change).  From a performance
perspective, it's the difference between an fsync() every 64k vs.
every 2GB.
Yup.  From a correctness perspective, it's the difference between a
corrupted filesystem on almost every crash and a corrupted filesystem
in some very rare cases.
I'm not sure I understand you're corruption comment.  Are you claiming
that without checksumming, you'll often get corruption or are you
claiming that without checksums, if you don't sync metadata updates
you'll get corruption?

qed is very careful about ensuring that we don't need to do syncs and we
don't get corruption because of data loss.  I don't necessarily buy your
checksumming argument.

Plus, doesn't btrfs do block level checksumming?  IOW, if you run a
workload where you care about this level of data integrity
validation, if you did btrfs + qed, you would be fine.
Or just btrfs by itself (use btrfs for snapshots and base images, use
qemu-img convert for shipping).

Since the majority of file systems don't do metadata checksumming,
it's not obvious to me that we should be.
The logic is that as data sizes increase, the probablity of error
increases.

I think one of the critical flaws in qcow2 was trying to invent a
better filesystem within qemu instead of just sticking to a very
simple and obviously correct format and letting the FS folks do the
really fancy stuff.
Well, if we introduce a minimal format, we need to make sure it isn't
too minimal.

I'm still not sold on the idea.  What we're doing now is pushing the
qcow2 complexity to users.  We don't have to worry about refcounts
now, but users have to worry whether they're the machine they're
copying the image to supports qed or not.

The performance problems with qcow2 are solvable.  If we preallocate
clusters, the performance characteristics become essentially the same
as qed.
By creating two code paths within qcow2.  It's not just the reference
counts, it's the lack of guaranteed alignment, compression, and some of
the other poor decisions in the format.
I'm not aware of any unaligned data in qcow2. Compression can leave some
sectors sparse, but that's something the FS has to deal with, not qcow2.

If my memory serves, you changed qcow2 some time ago to make sure that metadata is aligned but historically, we didn't always do that and the qcow2 doesn't enforce that metadata is aligned.

This means that if you did try to make a version of qcow2 that was totally async or really just was fast, you'd have to make sure you dealt with unaligned accesses and bounced buffers accordingly.

If you have two code paths in qcow2, you have non-deterministic
performance because users that do reasonable things with their images
will end up getting catastrophically bad performance.
Compression and encryption lead to bad performance, yes. These are very
clear criteria and something very easy to understand for users. I've
never heard any user complain about this "non-deterministic" behaviour.

That's because qcow2 has always been limited in it's performance so it's quite deterministic :-)

Don't get me wrong, you and others have done amazing things making qcow2 better than it was and it's pretty reasonable when dealing with IDE and a single backing spindle, but when dealing with virtio and a large storage array, it simply doesn't even come close to raw. FWIW, we'll numbers later this week with a detailed comparison.

Regards,

Anthony Liguori

A new format doesn't introduce much additional complexity.  We provide
image conversion tool and we can almost certainly provide an in-place
conversion tool that makes the process very fast.
I'm not convinced that in-place conversion is worth the trouble.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]