qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: Strategic decision: COW format


From: Chunqiang Tang
Subject: Re: [Qemu-devel] Re: Strategic decision: COW format
Date: Mon, 14 Mar 2011 17:32:15 -0400

> > > FVD's novel uses of the reference count table reduces the metadata 
update
> > > overhead down to literally zero during normal execution of a VM. 
This gets
> > > the bests of QCOW2's reference count table but without its 
oeverhead. In
> > > FVD, the reference count table is only updated when creating a new
> > > snapshot or deleting an existing snapshot. The reference count table 
is
> > > never updated during normal execution of a VM.
> > 
> > Do you want to send out a break-down of the steps (and cost) involved 
in doing:
> > 
> > 1. Snapshot creation.
> > 2. Snapshot deletion.
> > 3. Opening an image with n snapshots.

> Here is a detailed description. Relevant to the discussion of snapshot, 
FVD 
> uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup 
> table is very similar to QCOW2’s two-level lookup table, except that it 
is 
> much smaller in FVD, and is preallocated and hence contiguous in image. 
> 
> FVD’s refcount table is almost identical to that of QCOW2, but with a 
key 
> difference. An image consists of an arbitrary number of read-only 
snapshots, 
> and a single writeable image front, which is the current image state 
perceived
> by the VM. Below, I will simply refer to the read-only snapshots as 
snapshots,
> and refer to the “writeable image front” as “writeable-front.” QCOW2’s 
> refcount table counts clusters that are used by either read-only 
snapshots or 
> writeable-front. Because writeable-front changes as the VM runs, QCOW2 
needs 
> to update the refcount table on the fast path of normal VM execution. 
> 
> By contrast, FVD’s refcount table only counts chunks that are used by 
read-
> only snapshots, and does not count chunks used by write-front. This is 
the key
> that allows FVD to entirely avoid updating the refcount table on the 
fast path
> of normal VM execution. 
> 
> Below are the detailed steps for different operations.
> 
> O1: Open an image with n snapshots.
> 
> Let me introduce some basic concepts first. The storage allocation unit 
in FVD
> is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like
> that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file 
is 
> conceptually divided into chunks, where chunk 0 is the first 1MB of the 
image 
> file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of 
an 
> image file grow as needed, just like that of QCOW2. The refcount table 
is a 
> linear array “uint16_t refcount[]”. If a chunk j is referenced by s 
different 
> snapshots, then refcount[j] = s. If a new snapshot is created and this 
new 
> snapshot also uses chunk j, then refcount[j] is incremented to 
refcount[j] = s+1.
> 
> If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 
> chunks, and the size of the refcount table is 2MB. Loading the entire 
2MB 
> refcount table from disk into memory takes about 15 milliseconds. If the 

> virtual disk size perceived by the VM is also 1TB, FVD’s one-level 
lookup 
> table is 4MB. FVD’s one-level lookup table serves the same purpose as 
QCOW2’s 
> two-level lookup table, but FVD’s one-level table is much smaller and is 

> preallocated and hence continuous in the image. Loading the entire 4MB 
lookup 
> table from disk into memory takes about 20 milliseconds. These numbers 
mean 
> that it is quite affordable to scan the entire tables at VM boot time, 
> although the scan can also be avoided in FVD. The optimizations will be 
described later.
> 
> When opening an image with n snapshots, an unoptimized version of FVD 
performs
> the following steps:
> 
> O1: Load the entire 2MB reference count table from disk into memory. 
This step
> takes about 15ms.
> 
> O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.
> 
> O3: Use the two tables to build an in-memory data structure called 
“free-
> chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies 
> free chunks that are not used by either snapshots or writeable front, 
and 
> hence can be allocated for future writes. The size of the 
free-chunk-bitmap is
> only 125KB for a 1TB disk, and hence the memory overhead is negligible. 
The 
> free-chunk-bitmap also supports trim operations. The free-chunk-bitmap 
does 
> not have to be persisted on disk as it can always be rebuilt easily, 
although 
> as an optimization it can be persisted on disk on VM shutdown.
> 
> O4: Compare the refcount table and the lookup table to identify chunks 
that 
> are in both tables (i.e., shared) and hence the running VM’s write to 
those 
> chunks in writeable-front triggers copy-on-write. This step takes about 
2ms. 
> One bit in the lookup table’s entry is stolen to mark whether a chunk in 

> writeable-front is shared with snapshots and hence needs copy-on-write 
upon a write.
> 
> The whole process above, i.e., opening an image with n (e.g., n=1000) 
> snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, 
I 
> will describe optimizations that can further reduce this 39ms by saving 
the 
> 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization is 
more 
> than likely to an over-engineering effort, given that 39ms on VM boot 
perhaps 
> is not a big deal.
> 
> Once the image is opened, the refcount table is discarded and never 
updated 
> during normal execution of the VM. This is how FVD gets the bests of 
QCOW2’s 
> refcount table but without its runtime overhead. When the VM issues a 
write to
> a chunk in writeable-front and this chunk is shared with snapshots (this 
is 
> known by looking at the special marker bit in lookup table entries), FVD 
uses 
> the free-chunk-bitmap to find a free chunk, performs copy-on-write, and 
> updates the lookup table to point to that new chunk. This metadata 
change to 
> the lookup table is persisted in FVD’s journal and is not updated 
directly to 
> the writeable-front’s on-disk lookup table. When the VM shuts down 
gracefully,
> the entire lookup table is written back to the writeable-front’s on-disk 

> lookup table (this step takes about 20ms), and FVD’s image header is 
updated 
> to indicate that there is no need to recover from journal on the next VM 
boot.
> 
> Now let me describe an optimization that can further reduce the 39ms 
time 
> needed for opening an image with n snapshots. When the VM shuts down 
> gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk. 
When 
> the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk and 
skips 
> steps O1, O2, and O3. Since the lookup table is saved to disk on a clean 

> shutdown and one bit in the table entries marks whether a chunk is 
shared with
> snapshots, the scanning step in O4 can also be avoided. In other words, 
all 
> the steps above O1-O4 can all be avoided on a clean shutdown. They are 
needed 
> only after a host crash. During the recovery process after a host crash, 
the 
> journal re-builds the lookup table; the refcount table and the lookup 
table 
> rebuild the free-chunk-bitmap.
> 
> This is a description of FVD's image open operation.
> 
> (…to be continued… I am running out of time now, and will write about 
FVD’s 
> other snapshot operations in separate emails.) 

Let me continue my writeup to describe other snapshot related operations 
in FVD.

Operation 2: Snapshot creation.

FVD performs the following operations when creating a new snapshot.

C1: Create a new snapshot by saving a copy of FVD's bitmap and lookup 
table to a new location on disk. FVD's lookup table for a 1TB disk is 4MB 
and the bitmap is even smaller. This step takes about 30ms. 

C2: Load the 2MB refcount table from disk to memory. This step takes about 
15ms.

C3: Use the writeable-front's 4MB lookup table to update the in-memory 2MB 
refcount table. For a chunk j that exists in the lookup table, increment 
the refcount table by doing refcount[j]++. This step is an in-memory 
operation and takes about 2ms.

C4: Write the new 2MB refcount table to disk. This step takes about 15ms.

C5: Write the new list of snapshots to disk. This step takes about 10ms.

The snapshot creation process above takes about 72ms. Recall that on the 
fast path during the VM's normal execution, FVD never updates the on-disk 
refcount table and does not even keep the refcount table in memory. 
Conceptually, updating the refcount table is shifted from the fast path of 
VM execution to steps C2, C3, and C4 of the snapshot creation process. In 
another perspective, this also allows FVD to batch all updates to the 
refcount table and do it in a single, efficient, sequential write at the 
time of snapshot creation. This is the trick why internal snapshot using 
the refcount table is so efficient in FVD. 

Operation 3: Snapshot deletion.

FVD performs the following operations when deleting a snapshot. Suppose 
snapshot X is to be deleted.

D1: Load the 2MB refcount table from disk to memory. This step takes about 
15ms.

D2: Load snapshot X's 4MB lookup table from disk. This step takes about 
20ms.

D3: Use snapshot X's 4MB lookup table to update the in-memory 2MB refcount 
table. For a chunk j that exists in snapshot X's lookup table, decrement 
the refcount table by doing refcount[j]--. This step is an in-memory 
operation and takes about 2ms.

C4: Write the new 2MB refcount table to disk. This step takes about 15ms.

C5: Write the new list of snapshots to disk. This step takes about 10ms.

The snapshot deletion process above takes about 62ms. 

Operation 4: Go to a snapshot, i.e., derive a new writeable-front based on 
a snapshot.

FVD performs the following operations when going to a snapshot X.

C1: Load snapshot X's bitmap and lookup table from disk into memory, and 
save a new copy on disk, which will be the on-disk metadata for the new 
writeable-front. FVD's lookup table for a 1TB disk is 4MB and the bitmap 
is even smaller. This step takes about 60ms. 

C2: Update the FVD image's header to point to the new bitmap and the new 
lookup table. This step takes 7ms.

C3: Follow the image open operation, which take about 39ms in total.

The goto snapshot process above takes about 106ms. Note that when going to 
a snapshot, FVD need not update the refcount table, because the refcount 
table only counts read-only snapshots, which does not change during a goto 
snapshot operation. 

In summary, snapshot creation, snapshot deletion, and snapshot goto are 
all fast operations in FVD, taking 106ms or less. Most importantly, on the 
fast path during the normal execution of the VM, FVD never updates the 
on-disk refcount table and does not even keep the refcount table in 
memory. The only overhead in FVD is the 125KB free-chunk-bitmap that must 
be kept in memory, but the free-chunk-bitmap is already needed for 
supporting trim and recovering leaked storage space on a host crash, even 
if we do not introduce the snapshot feature into FVD. In other words, FVD 
gets all the benefits of QCOW2's internal snapshot function but without 
paying any of its overhead. I truly believe this is the ideal solution for 
snapshot, going beyond the current state-of-the-art in QCOW2 and VMware. 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang


reply via email to

[Prev in Thread] Current Thread [Next in Thread]