Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Avi Kivity
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Sun, 12 Sep 2010 17:56:02 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.3

 On 09/12/2010 05:13 PM, Anthony Liguori wrote:

On 09/12/2010 08:24 AM, Avi Kivity wrote:
Not atexit, just when we close the image.
Just a detail, but we need an atexit() handler to make sure blockdevices get closed because we have too many exit()s in the code today.
Right.
So when you click the 'X' on the qemu window, we get to wait a fewseconds for it to actually disappear because it's flushing metadata todisk..

If it was doing heavy write I/O, you'll need to wait a bit (a fewseconds are a few hundreds of clusters worth of metadata). If itmanaged to flush while you were moving your mouse, no delay.

When considering development time, also consider the time it willtake users to actually use qed (6 months for qemu release users, ~9months on average for semiannual community distro releases, 12-18months for enterprise distros. Consider also that we still have tosupport qcow2 since people do use the extra features, and since Idon't see us forcing them to migrate.
I'm of the opinion that qcow2 is unfit for production use for the typeof production environments I care about. The amount of changes neededto make qcow2 fit for production use put it on at least the sametimeline as you cite above.


If it's exactly the same time, we gain by having one less format.

Yes, there are people today that qcow2 is appropriate but by the samerespect, it will continue to be appropriate for them in the future.
In my view, we don't have an image format fit for production use.You're arguing we should make qcow2 fit for production use whereas Iam arguing we should start from scratch. My reasoning for startingfrom scratch is that it simplifies the problem. Your reasoning forimproving qcow2 is simplifying the transition for non-production usersof qcow2.
We have an existence proof that we can achieve good data integrity andgood performance by simplifying the problem. The burden still isestablishing that it's possible to improve qcow2 in a reasonableamount of effort.


Agreed.

I realize it's somewhat subjective though.
While qed looks like a good start, it has at least three flawsalready (relying on physical image size, relying on fsck, and limitedlogical image size). Just fixing those will introduce complication.What about new features or newly discovered flaws?
Let's quantify fsck. My suspicion is that if you've got the storagefor 1TB disk images, it's fast enough that fsck can not be so bad.

It doesn't follow. The storage is likely to be shared among manyguests. The image size (or how full it is) don't really matter; startuptime is the aggregate number of L2s over all images starting now,divided by the number of spindles, divided by the number of IOPS eachspindle provides.

Since an L2 spans a lot of logical address space, it is likely that manyL2s will be allocated (in fact, it makes sense to preallocate them).

Keep in mind, we don't have to completely pause the guest whilefsck'ing. We simply have to prevent cluster allocations. We canallow reads and we can allow writes to allocated clusters.


True.

Consequently, if you had a 1TB disk image, it's extremely likely thatthe vast majority of I/O is just to allocated clusters which meansthat fsck() is entirely a background task. The worst case scenario isactually a half-allocated disk.

No, the worst case is 0.003% allocated disk, with the allocated clustersdistributed uniformly. That means all your L2s are allocated, butalmost none of your clusters are.

But since you have to boot before you can run any serious test, if ittakes 5 seconds to do an fsck(), it's highly likely that it's not evennoticeable.


What if it takes 300 seconds?

Maybe I'm broken with respect to how I think, but I find statemachines very easy to rationalize.
Your father's state machine. Not as clumsy or random as a thread; anelegant weapon for a more civilized age
I find your lack of faith in QED disturbing.

When 900 years old you reach, state machines you will not find so easyto understand.

To me, the biggest burden in qcow2 is thinking through how you dealwith shared resources. Because you can block for a long period oftime during write operations, it's not enough to just carry a mutexduring all metadata operations. You have to stage operations andcommit them at very specific points in time.
The standard way of dealing with this is to have a hash table formetadata that contains a local mutex:
    l2cache = defaultdict(L2)

    def get_l2(pos):
        l2 = l2cache[pos]
        l2.mutex.lock()
        if not l2.valid:
             l2.pos = pos
             l2.read()
             l2.valid = True
        return l2

    def put_l2(l2):
        if l2.dirty:
            l2.write()
            l2.dirty = False
        l2.mutex.unlock()
You're missing how you create entries.  That means you've got to do:

def put_l2(l2):
   if l2.committed:
       if l2.dirty
           l2.write()
           l2.dirty = False
       l2.mutex.unlock()
    else:
       l2.mutex.lock()
       l2cache[l2.pos] = l2
       l2.mutex.unlock()

The in-memory L2 is created by defaultdict(). I did omit linking L2into L1, by that's a function call. With a state machine, it's a newstring of states and calls.

And this really illustrates my point. It's a harder problem that itseems. You also are keeping l2 reads from occurring when flushing adirty l2 entry which is less parallel than what qed achieves today.

There are standard threading primitives like shared/exclusive locks orbarriers that can be used to increase concurrency. It's nowhere near asbrittle as modifying a state machine.

This is part of why I prefer state machines. Acquiring a mutex is tooeasy and it makes it easy to not think through what all could berunning. When you are more explicit about when you are allowingconcurrency, I think it's easier to be more aggressive.
It's a personal preference really. You can find just as many folks onthe intertubes that claim Threads are Evil as claim State Machines areEvil.


The dark side of the force is tempting.

The only reason we're discussing this is you've claimed QEMU's statemachine model is the biggest inhibitor and I think that's oversimplifying things. It's like saying, QEMU's biggest problem is thattoo many of it's developers use vi verses emacs. You may personallybelieve that vi is entirely superior to emacs but by the same token,you should be able to recognize that some people are able to beproductive with emacs.
If someone wants to rewrite qcow2 to be threaded, I'm all for it. Idon't think it's really any simpler than making it a state machine. Ifind it hard to believe you think there's an order of magnitudedifference in development work too.


Kevin is best positioned to comment on this.

It's far easier to just avoid internal snapshots altogether and thisis exactly the thought process that led to QED. Once you dropsupport for internal snapshots, you can dramatically simplify.
The amount of metadata is O(nb_L2 * nb_snapshots). For qed,nb_snapshots = 1 but nb_L2 can be still quite large. If fsck is toolong for one, it is too long for the other.
nb_L2 is very small. It's exactly n / 2GB + 1 where n is image size.Since image size is typically < 100GB, practically speaking it's lessthan 50.
OTOH, nb_snapshots in qcow2 can be very large. In fact, it's notunrealistic for nb_snapshots to be >> 50. What that means is thatinstead of metadata being O(n) as it is today, it's at least O(n^2).

Why is in n^2? It's still n*m. If your image is 4TB instead of 100GB,the time increases by a factor of 40 for both.

Not doing qed-on-lvm is definitely a limitation. The one use caseI've heard is qcow2 on top of clustered LVM as clustered LVM issimpler than a clustered filesystem. I don't know the space wellenough so I need to think more about it.
I don't either. If this use case survives, and if qed isn't changedto accomodate it, it means that that's another place where qed can'tsupplant qcow2.
I'm okay with that. An image file should require a file system. If Iwas going to design an image file to be used on top of raw storage, Iwould take an entirely different approach.


That spreads our efforts further.

Refcount table. See above discussion for my thoughts on refcounttable.
Ok. It boils down to "is fsck on startup acceptable". Without afreelist, you need fsck for both unclean shutdown and for UNMAP.
To rebuild the free list on unclean shutdown.

If you have an on-disk compact freelist, you don't need that fsck. Ifyour freelist is the L2 table, then you need that fsck to find out ifyou have any holes in your image.

On the other hand, allocating a cluster in qcow2 as it is now requiresscanning the refcount table. Not very pretty. Kevin, how does thatperform?

(an aside: with cache!=none we're bouncing in the kernel as well; wereally need to make it work for cache=none, perhaps use O_DIRECT fordata and writeback for metadata and shared backing images).
QED achieves zero-copy with cache=none today. In fact, ourperformance testing that we'll publish RSN is exclusively withcache=none.

In this case, preallocation should really be cheap, since there isn't aton of dirty data that needs to be flushed. You issue an extra flushonce in a while so your truncate (or physical image size in the header)gets to disk, but that doesn't block new writes.

It makes qed/lvm work, and it replaces the need to fsck for the nextallocation with the need for a background scrubber to reclaim storage(you need that anyway for UNMAP). It makes the whole thing a lot moreattractive IMO.

Yes, you'll want to have that regardless. But adding new things toqcow2 has all the problems of introducing a new image format.
Just some of them. On mount, rewrite the image format as qcow3. Onclean shutdown, write it back to qcow2. So now there's no risk ofdata corruption (but there is reduced usability).
It means on unclean shutdown, you can't move images to olderversions. That means a management tool can't rely on the mobility ofimages which means it's a new format for all practical purposes.
QED started it's life as qcow3. You start with qcow3, remove thefeatures that are poorly thought out and make correctness hard, addsome future proofing, and you're left with QED.
We're fully backwards compatible with qcow2 (by virtue that qcow2 isstill in tree) but new images require new versions of QEMU. Thatsaid, we have a conversion tool to convert new images to the oldformat if mobility is truly required.
So it's the same story that you're telling above from an end-userperspective.

It's not exactly the same story (you can enable it selectively, or youcan run fsck before moving) but I agree it isn't a good thing.

They are once you copy the image. And power loss is the samething as unexpected exit because you're not simply talking aboutdelaying a sync, you're talking staging future I/O operationspurely within QEMU.
qed is susceptible to the same problem. If you have a 100MB writeand qemu exits before it updates L2s, then those 100MB are leaked.You could alleviate the problem by writing L2 at intermediatepoints, but even then, a power loss can leak those 100MB.
qed trades off the freelist for the file size (anything beyond thefile size is free), it doesn't eliminate it completely. So youstill have some of its problems, but you don't get its benefits.
I think you've just established that qcow2 and qed both require anfsck. I don't disagree :-)
There's a difference between a background scrubber and a foregroundfsck.
The difference between qcow2 and qed is that qed relies on the filesize and qcow2 uses a bitmap.
The bitmap grows synchronously whereas in qed, we're not relying onsynchronous file growth. If we did, there would be no need for an fsck.
If you attempt to grow the refcount table in qcow2 without doing async(), then you're going to have to have an fsync to avoid corruption.
qcow2 doesn't have an advantage, it's just not trying to be assophisticated as qed is.

The difference is between preallocation and leaking, on one hand, anduncommitted allocation and later rebuilds, on the other. It isn't adifference between formats, but between implementations.


--
error compiling committee.c: too many arguments to function

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: [Qemu-devel] [Bug 636446] [NEW] prep ppc "machine" no more working
Next by Date: Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread