[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] QCOW2 deduplication design

From: Benoît Canet
Subject: [Qemu-devel] QCOW2 deduplication design
Date: Wed, 9 Jan 2013 16:24:44 +0100
User-agent: Mutt/1.5.21 (2010-09-15)


Here is a mail to open a discussion on QCOW2 deduplication design and

The actual deduplication strategy is RAM based.
One of the goal of the project is to plan and implement an alternative way to do
the lookups from disk for bigger images.

I will in a first section enumerate the disk overheads of the RAM based lookup
strategy and then in the second section enumerate the additionals costs of doing
lookups in a disk based prefix b-tree.

Comments and sugestions are welcome.

I) RAM based lookups overhead

The qcow2 read path is not modified by the deduplication patchset.

Each cluster written gets its hash computed.

Two GTrees are used to give access to the hashes : one indexed by hash and
one other indexed by physical offset.

I.0) unaligned write

when a write is unaligned or smaller than a 4KB cluster the deduplication code
issue one or two reads to get the missing data required to build a 4KB*n linear
The deduplication metrics code show that this situation don't happen with virtio
and ext3 as a guest partition.

I.1) First write overhead

The hash is computed

the cluster is not duplicated so the hash is stored in a linked list

after that the writev call get a new 64KB L2 dedup hash block corresponding to
the physical sector of the writen cluster.
(This can be an allocating write requiring to write the offset of the new block
in the dedup table and flush)

The hash is written in the l2 dedup hash block and flushed later by the

I.2) Same cluster rewrite at the same place

The hash is computed

qcow2_get_cluster_offset is called and the result is used to check that it is a

The cluster is counted as duplicated and not rewriten on disk

I.3) First duplicated cluster write

The hash is computed

qcow2_get_cluster_offset is called and we see that we are not rewriting the same
cluster at the same place

I.3.a) The L2 entry of the first cluster written with this hash is overwritten
to remove the QCOW_OFLAG_COPIED flag.

I.3.b) the dedup hash block of the hash is overwritten to remember at the next
startup that QCOW_OFLAG_COPIED has been cleared.

A new L2 entry is created for this logical sector pointing to the physical
cluster. (potential allocating write)

the refcount of the physical cluster is updated

I.4) Duplicated clusters further writes

Same as I.2 without I.3.a and I.3.b

I.5) cluster removal
When a L2 entry to a cluster become stale the qcow2 code decrement the
When the refcount reach zero the L2 hash block of the stale cluster
is written to clear the hash.
This happen often and require the second GTree to find the hash by it's physical
sector number

I.6) max refcount reached
The L2 hash block of the cluster is written in order to remember at next startup
that it must not be used anymore for deduplication. The hash is dropped from the

II) Disk based lookup additional overhead

My initial idea is to replace the RAM based GTree indexed by hash
by a prefix b-tree stored on disk.
As hash are mostly random the prefix tree should work well.

An additional data structure would be needed to do the lookups by
physical offsets.
This additional data structure is clearly a bottleneck in this design.

Each lookup will cost 0(log n) disk ios.
Insertion and deletion can cost the double of io (need to split and merge leafs
and nodes)

II.1) First write overhead

O(log n) lookup in the b-tree for lookup of the hash

Disks Lookups by physical sector to remove stale hash node.
(What structure to use for this)

When the hash is written the leaf of the failed lookup can be reused
If leaf is full splitting the leaf and restructuring the tree must be done
in O(log n).

Update of a not yet defined way to lookup b-tree leaf by their offset on disk
(potential allocating write)

II.2) Same cluster rewrite at the same place
lookup in the b-tree to find the hash

II.3) First duplicated cluster write.
lookup in the b-tree to find the hash

The leaf of the b-tree must be overwritten to remember that we have cleared

II.4) Duplicated cluster further writes
lookup in the b-tree to find the hash

II.5) cluster removal
A query to find the hash b-tree leaf by sector must be done on a not yet defined
data structure

The hash must be removed from the b-tree leaf

Two methods can be used to bypass the needs of this additional data structure.
-Read the cluster then compute this hash and use the hash for the removal
-Store hash and cluster forever while setting their refcount at a special value
meaning "reserved for future reuse".

II.6) max refcount reached

The ram implementation just drop the hash corresponding to the written cluster
and mark on disk that this hash must not be reloaded. A dupplicate hash is
then created.

I have not found yet how to handle this with a b-tree as it's not supposed
to contains duplicate keys.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]