Re: [Qemu-devel] QCOW2 deduplication

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QCOW2 deduplication

From:	Kevin Wolf
Subject:	Re: [Qemu-devel] QCOW2 deduplication
Date:	Wed, 27 Feb 2013 17:40:53 +0100
User-agent:	Mutt/1.5.21 (2010-09-15)

Am 27.02.2013 um 16:58 hat Benoît Canet geschrieben:
> > > The current prototype of the QCOW2 deduplication uses 32 bytes SHA256 or 
> > > SKEIN
> > > hashes to identify each 4KB clusters with a very low probability of 
> > > collisions.
> > 
> > How do you handle the rare collision cases? Do you read the original
> > cluster and compare the exact contents when the hashes match?
> 
> Stefan found a paper with the math required to compute the collision
> probability: http://http://plan9.bell-labs.com/sys/doc/venti/venti.html
>              (Section 3.1)
> Doing the math for 1 Exabyte of stored data with 4KB clusters and 256 bits
> hashes gives a probability of 2.57E-49.
> The probability being low enough I plan to code the read/compare as an
> option that the users would toggle.
> The people who wrote the deduplication in ZFS have done it this way.

Fair enough. If you want to gamble with your data for some more
performance, you can turn it off. Should we add some comptaible taint
flag after the image has been used without collision detection?

> > > The 4KB cluster size was choosen because it's the block size of most of 
> > > the
> > > filesytems that will run on guests.
> > > The current prototype shows that very high deduplication ratios are doable
> > > with 4KB clusters.
> > 
> > Is this the qcow2 cluster size or is it independent from it? Is 4k a
> > must or is it just what you think is a good default?
> 
> It's the qcow2 cluster size.
> It's a must because most guest filesystem use 4KB blocks.
> Using 4KB cluster size avoid to be forced to to some extra data read to 
> complete
> unaligned/partial writes from the guest.
> (Imagine a guest doing 4KB writes while the dedup need 64KB of data to compute
> a single hash).

You can't bake assumptions about guests into the file format. If this is
just a good default for most cases, then it needs to be configurable.

> > If you do read the cluster, how do you protect against concurrent write
> > requests to same cluster? Just hold s->lock or something finer grained?
> 
> I have s->dedup_lock which is took at the begining of qcow2_co_writev in the
> deduplication case in order to serialize writes and avoid the concurent write
> case.
> 
> An alternative would be to use something like wait_for_overlapping_request in
> the block layer to deal with this.
> Thus I don't know if this is a better solution.

s->dedup_lock is global for the whole image, whereas a mechanism like
wait_for_overlapping_request would only stall requests that really touch
the same clusters. I guess it's a tradeoff between simple code and
better performance. Starting with simple code is usually not a bad idea.

> > What happens if the host crashes between writing the data and updating
> > the hashes?
> 
> There is three cases : it's a write or the dedup kick in or
> it's a rewrite of a physical block.
> 
> If it's a write it's fine this cluster content will not be deduplicated next 
> time
> the dedup see it.
> 
> If it trigger the dedup it will just increase the refcount and write the l2
> entry.
> 
> If it's a rewrite it's more annoying and I think that doing allocating writes
> on the rewrite case solves this because the hash doesn't need to be updated
> anymore.

Yes, rewrite is the case that I was interested in. I think I start to
understand why you're looking into copy on rewrite.

> > > Each of the second step hash table would have a small hash filter stored 
> > > in ram.
> > > 
> > > These hash filters would be interleaved in order to make their traversal 
> > > cache
> > > friendly.
> > > 
> > > This would need to linearly scan the hash filters at the speed of RAM
> > > in order to know which hash table must be queried.
> > > 
> > > The cost to lookup a hash would almost O(1) random read.
> > > The cost to write a hash is amortized as it's written in a log.
> > > 
> > > To keep the risk of false positive low with the growing hash filter list
> > > I would use 4 bytes hash filter entries instead of 2 bytes ones.
> > > This would use 1GB of RAM for 1TB worth of stored hash.
> > > 
> > > As the two first step of SILT are the best data structures I found while
> > > digging this topic could you give your feelings about it ?
> > 
> > I haven't really looked at it yet, but as you describe it, the log seems
> > to be the crucial element. I think it can be combined with every on-disk
> > data structure, in fact. So if what they use fits well, sure, use it,
> > but if it doesn't, we can still get the write cost down this way.
> 
> Yes the log totally get rid of the random writes and make the data IO/metadata
> IO ratio fine when the guest write new data.

If we decide to go this way, I would vote for introducing a general
qcow2 journal that can contain this, but also information about normal
cluster allocation and things.

> The only real extra cost would be a 4KB read in one of the hash table.
> A small extra cost would be the linear scanning of the hash table filter 
> stored in RAM.

Yes, and unfortunately I don't see any way to get rid of it in the
average case.

Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] QCOW2 deduplication, Benoît Canet, 2013/02/26
- Re: [Qemu-devel] QCOW2 deduplication, Kevin Wolf, 2013/02/27
  - Re: [Qemu-devel] QCOW2 deduplication, Benoît Canet, 2013/02/27
    - Re: [Qemu-devel] QCOW2 deduplication, Kevin Wolf <=
    - Re: [Qemu-devel] QCOW2 deduplication, Stefan Hajnoczi, 2013/02/28
    - Re: [Qemu-devel] QCOW2 deduplication, Kevin Wolf, 2013/02/28

Prev by Date: [Qemu-devel] [ARM] Cortex-R4F and VFP3-D16
Next by Date: Re: [Qemu-devel] [PATCH] net: use socket_set_nodelay() for -netdev socket
Previous by thread: Re: [Qemu-devel] QCOW2 deduplication
Next by thread: Re: [Qemu-devel] QCOW2 deduplication
Index(es):
- Date
- Thread