qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_syn


From: Anthony Liguori
Subject: [Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes"
Date: Wed, 25 Aug 2010 09:14:07 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 08/25/2010 09:00 AM, Avi Kivity wrote:
 On 08/25/2010 04:42 PM, Anthony Liguori wrote:
On 08/25/2010 08:23 AM, Avi Kivity wrote:
 On 08/25/2010 03:46 PM, Anthony Liguori wrote:

If we had another disk format that only supported growth and metadata for a backing file, can you think of another failure scenario?

btw, only supporting growth is a step backwards. Currently file-backed disks keep growing even the guest-used storage doesn't grow, since once we allocate something we never release it. But eventually guests will start using TRIM or DISCARD or however it's called, and then we can expose it and reclaim unused blocks.

You can do this in one of two ways. You can do online compaction or you can maintain a free list. Online compaction has an advantage because it does not require any operations in the fast path whereas a free list would require ordered metadata updates (must remove something from the first list before updating the l2 table) which implies a sync.

DISCARD/TRIM can queue blocks to the same preallocated block list we have to optimize allocation. New allocations can come from this list, if it grows too large we sync part of it to disk to avoid loss of a lot of free space on power fail.

At a high level, I don't think online compaction requires any specific support from an image format.


You need to know that the block is free and can be reallocated.

Semantically, TRIM/DISCARD means that "I don't care about the contents of the block anymore until I do another write." Behind the scenes, we can keep track of which blocks have been discarded in an in-memory list whereas the first write to the block causes it to be evicted from the discarded list.

A background task would attempt to detect idle I/O and copy a block from the end of the file to a location on the discarded list. When the copy has completed, you can then remove the L2 entry for the discarded block (effectively punching a hole in the image), sync, and then update the l2 entry for the block at the end of file location to point to the new block location. You can then ftruncate to reduce overall file size.

If you tried to maintain a free list, then you would need to sync on TRIM/DISCARD which is potentially a fast path. While a background task may be less efficient in the short term, it's just as efficient in the long term and it has the advantage of keeping any fast path fast.

Regards,

Anthony Liguori




reply via email to

[Prev in Thread] Current Thread [Next in Thread]