[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] QCOW2 deduplication design
From: |
Stefan Hajnoczi |
Subject: |
Re: [Qemu-devel] QCOW2 deduplication design |
Date: |
Wed, 9 Jan 2013 17:16:04 +0100 |
On Wed, Jan 9, 2013 at 4:24 PM, Benoît Canet <address@hidden> wrote:
> Here is a mail to open a discussion on QCOW2 deduplication design and
> performance.
>
> The actual deduplication strategy is RAM based.
> One of the goal of the project is to plan and implement an alternative way to
> do
> the lookups from disk for bigger images.
>
>
> I will in a first section enumerate the disk overheads of the RAM based lookup
> strategy and then in the second section enumerate the additionals costs of
> doing
> lookups in a disk based prefix b-tree.
>
> Comments and sugestions are welcome.
>
> I) RAM based lookups overhead
>
> The qcow2 read path is not modified by the deduplication patchset.
>
> Each cluster written gets its hash computed.
>
> Two GTrees are used to give access to the hashes : one indexed by hash and
> one other indexed by physical offset.
What is the GTree indexed by physical offset used for?
> I.0) unaligned write
>
> when a write is unaligned or smaller than a 4KB cluster the deduplication code
> issue one or two reads to get the missing data required to build a 4KB*n
> linear
> buffer.
> The deduplication metrics code show that this situation don't happen with
> virtio
> and ext3 as a guest partition.
If the application uses O_DIRECT inside the guest you may see <4 KB
requests even on ext3 guest file systems. But in the buffered I/O
case the file system will use 4 KB blocks or similar.
>
> I.1) First write overhead
>
> The hash is computed
>
> the cluster is not duplicated so the hash is stored in a linked list
>
> after that the writev call get a new 64KB L2 dedup hash block corresponding to
> the physical sector of the writen cluster.
> (This can be an allocating write requiring to write the offset of the new
> block
> in the dedup table and flush)
>
> The hash is written in the l2 dedup hash block and flushed later by the
> dedup_block_cache
>
> I.2) Same cluster rewrite at the same place
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and the result is used to check that it is
> a
> rewrite
>
> The cluster is counted as duplicated and not rewriten on disk
This case is when identical data is rewritten in place? No writes are
required - this is the scenario where online dedup is faster than
non-dedup because we avoid I/O entirely.
>
> I.3) First duplicated cluster write
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and we see that we are not rewriting the
> same
> cluster at the same place
>
> I.3.a) The L2 entry of the first cluster written with this hash is overwritten
> to remove the QCOW_OFLAG_COPIED flag.
>
> I.3.b) the dedup hash block of the hash is overwritten to remember at the next
> startup that QCOW_OFLAG_COPIED has been cleared.
>
> A new L2 entry is created for this logical sector pointing to the physical
> cluster. (potential allocating write)
>
> the refcount of the physical cluster is updated
>
> I.4) Duplicated clusters further writes
>
> Same as I.2 without I.3.a and I.3.b
>
> I.5) cluster removal
> When a L2 entry to a cluster become stale the qcow2 code decrement the
> refcount.
> When the refcount reach zero the L2 hash block of the stale cluster
> is written to clear the hash.
> This happen often and require the second GTree to find the hash by it's
> physical
> sector number
This happens often? I'm surprised. Thought this only happens when
you delete snapshots or resize the image file? Maybe I misunderstood
this case.
> I.6) max refcount reached
> The L2 hash block of the cluster is written in order to remember at next
> startup
> that it must not be used anymore for deduplication. The hash is dropped from
> the
> gtrees.
Interesting case. This means you can no longer take snapshots
containing this cluster because we cannot track references :(.
Worst case: guest fills the disk with the same 4 KB data (e.g.
zeroes). There is only a single data cluster but the refcount is
maxed out. Now it is not possible to take a snapshot.
Stefan