qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QCOW2 deduplication


From: Kevin Wolf
Subject: Re: [Qemu-devel] QCOW2 deduplication
Date: Wed, 27 Feb 2013 13:08:20 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Am 26.02.2013 um 18:14 hat Benoît Canet geschrieben:
> 
> Hello Kevin,
> 
> As you are best person to discuss QCOW2 implementations issues with I am 
> writing
> this mail so you can know what has been done on deduplication and what I am
> planning to do next.
> 
> In short I need your feedback before going into another code sprint and being 
> in
> need of another code review.
> 
> Current status
> --------------
> 
> Also see (http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01260.html)
> 
> The current prototype of the QCOW2 deduplication uses 32 bytes SHA256 or SKEIN
> hashes to identify each 4KB clusters with a very low probability of 
> collisions.

How do you handle the rare collision cases? Do you read the original
cluster and compare the exact contents when the hashes match?

> The 4KB cluster size was choosen because it's the block size of most of the
> filesytems that will run on guests.
> The current prototype shows that very high deduplication ratios are doable
> with 4KB clusters.

Is this the qcow2 cluster size or is it independent from it? Is 4k a
must or is it just what you think is a good default?

> The SKEIN implementation I am linking to goes at 400MB/s on a modern machine
> so it's not the bottleneck.
> 
> The deduplication code does not modify the read path.
> 
> The deduplication code is hooked in qcow2_co_writev in three places.
> 
> qcow2_dedup_read_missing_and_concatenate:
> In the first place at most two read are issued to build a buffer with a size
> which is a multiple of cluster size.
> These two reads will be at the begining and at the ending of the buffer.
> In practice guest filesystems use 4KB clusters so no unaligned write is done
> and these two reads are not done.

If you do read the cluster, how do you protect against concurrent write
requests to same cluster? Just hold s->lock or something finer grained?

> qcow2_dedup:
> As we discussed on irc a few month ago this function tries to deduplicate as
> much consecutives clusters as it can.
> It then will skip all the non deduplicables clusters and locate the next 
> cluster
> that can be deduplicated.
> This way qcow_co_writev will knows exactly how much data it has to write.
> 
> qcow2_dedup_store_new_hashes:
> This function is to be called after the actual new data is written to disk in
> order to make the hashes persistents.

What happens if the host crashes between writing the data and updating
the hashes?

> When qcow2_alloc_cluster_link_l2 is freeing a block it's hash is deleted from
> disk.

qcow2_alloc_cluster_link_l2() frees clusters only if they had a
refcount > 1 and COW was needed. The cluster still has a refcount > 0
after one reference was dropped, so there's no reason to delete the
hash.

> When the refcount of a block exceed 2^16/2 the hash of this block is tagged 
> and
> forgotten so there will be rooms to take qcow2 snapshots.
> 
> Most of the code is done in qcow2-dedup.c
> 
> The current implementation store the hashes in RAM in two glib GTrees.
> The first one indexed by hash and the second one indexed by cluster index.
> 
> On disk the hashes are stored on a sparse L1/L2 table indexed by physical
> clusters sectors.
> 
> This organization works well for images of a few GB but as Stefan says it will
> not be sufficient for TB sized images because the GTrees eat too much
> RAM.

Yes, we can't keep everything in RAM for large images.

What is the physical offset -> hash mapping actually used for?

> Alternatives disk data structures.
> ----------------------------------
> 
> I studied a bit the classic data structures that could be used and even
> prototyped one to see how it behave.
> 
> The first class of data structures that could be used to store the hashes
> is the b-tree familly.
> Its well described in textbooks but needs at most O(log(n)) 4KB reads to 
> lookup
> only one hash.
> The implementation seems a bit complex and writting a new hash would also need
> O(log(n)) IO.
> It would make the deduplication badly performing.

n being the number of allocated clusters in the image, right? And log
the logarithm with base m, with m being the number of elements in one
tree node?

This doesn't really sound _that_ bad. In practice log(n) would be like 3
or 4 then. The first two levels could probably be in memory without
problems.

The problem that I really see is that with hashes you don't really have
any locality that caches for all levels could use, so you don't get
around at least one disk read for each request.

> The second class of data structures that could be used are hashes.
> The promise is 0(1) lookup and insertion.
> The first shortcoming is that they don't grow well.
> Extendible hash (http://goo.gl/GiB7A) try to address the growth issues.
> The second shortcoming is that every new 4KB block stored on disk would 
> generate
> a new hash that would need 1 random 4KB read + 1 random 4KB write to be 
> stored.
> With SSD storage being more and more used and not being good at random write
> it makes this solution less attractive.

O(1)? What about collisions?

> I finally found this patent free paper : SILT (http://goo.gl/AG493).
> 
> In a first step it store new hashes in a log hence making hash writes
> sequentials and fast at the cost of some RAM.
> In a second step it reorganize the last log into a hash table consuming 2 
> bytes
> of RAM per hash.
> Then in a third step it merge the seconds step hash tables into a very 
> efficient
> store.
> 
> The full SILT implemetation available on github is 15k sloc of C++.
> I think that most of the complexity comes from the third step.
> 
> I am proposing to build the QCOW2 deduplication hash store on the first two
> steps of SILT to get most of the benefits without the complexity.

This makes me wonder if we should look into more general journaling for
all parts of qcow2.

> The RAM requirement would be approx 4 bytes per 4KB of data stored on disk.

What exactly are these 4k? Keep in mind that we could use data
structures that occupy more than one cluster if it helps - like QED used
four clusters for each L2 table by default.

> Each of the second step hash table would have a small hash filter stored in 
> ram.
> 
> These hash filters would be interleaved in order to make their traversal cache
> friendly.
> 
> This would need to linearly scan the hash filters at the speed of RAM
> in order to know which hash table must be queried.
> 
> The cost to lookup a hash would almost O(1) random read.
> The cost to write a hash is amortized as it's written in a log.
> 
> To keep the risk of false positive low with the growing hash filter list
> I would use 4 bytes hash filter entries instead of 2 bytes ones.
> This would use 1GB of RAM for 1TB worth of stored hash.
> 
> As the two first step of SILT are the best data structures I found while
> digging this topic could you give your feelings about it ?

I haven't really looked at it yet, but as you describe it, the log seems
to be the crucial element. I think it can be combined with every on-disk
data structure, in fact. So if what they use fits well, sure, use it,
but if it doesn't, we can still get the write cost down this way.

> Removal of the second lookup data structure
> -------------------------------------------
> 
> Needing two data structures on disk for deduplication is annoying because they
> need to be synchronized and it may required a third structure (journal).
> 
> The second data structure is used to lookup a hash by it's physical cluster
> sectors.
> 
> It used in a few place in the code:
> 
> When a cluster is freed (by qcow2_alloc_cluster_link_l2) ,
> when a cluster is overwritten
> When a cluster has reached half of the max refcount.
> 
> The half max refcount case already know the hash value so the by sector
> lookup can be avoided.
> 
> The cluster freed case could read the cluster and recompute the hash solving 
> the
> issue at the cost of an extra 4KB read.
> 
> The overwritten case can probably be solved if overwriting writes became
> allocating writes. This way the existing cluster is leaved in place and will 
> be
> freed later.
> 
> Is it thinkable that overwriting writes be changed to allocating writes in the
> deduplication case ?

Allocating writes come with extra overhead and complexity. I'm not sure
if it's really advisable to do this. But if you think it's the best way
to solve the problem, I'm not opposed to it.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]