qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication


From: Troy Benjegerdes
Subject: Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Date: Thu, 3 Jan 2013 13:51:02 -0600
User-agent: Mutt/1.5.20 (2009-06-14)

On Thu, Jan 03, 2013 at 01:39:48PM +0100, Stefan Hajnoczi wrote:
> On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote:
> > The probability may be 'low' but it is not zero. Just because it's
> > hard to calculate the hash doesn't mean you can't do it. If your
> > input data is not random the probability of a hash collision is
> > going to get scewed.
> 
> The cost of catching hash collisions is an extra read for every write.
> It's possible to reduce this with a 2nd hash function and/or caching.
> 
> I'm not sure it's worth it given the extremely low probability of a hash
> collision.
> 
> Venti is an example of an existing system where hash collisions were
> ignored because the probability is so low.  See 3.1. Choice of Hash
> Function section:
> 
> http://plan9.bell-labs.com/sys/doc/venti/venti.html


If you believe that it's 'extremely low', then please provide either:

* experimental evidence to prove your claim
* an insurance underwriter who will pay-out if data is lost due to
a hash collision.

What I have heard so far is a lot of theoretical posturing and no
experimental evidence.

Please google for "when TCP checksums and CRC disagree" for experimental
evidence of problems assuming that probability is low. This is the
abstract:

"Traces of Internet packets from the past two years show that between 1 packet 
in 1,100 and 1 packet in 32,000 fails the TCP checksum, even on links where 
link-level CRCs should catch all but 1 in 4 billion errors. For certain 
situations, the rate of checksum failures can be even higher: in one hour-long 
test we observed a checksum failure of 1 packet in 400. We investigate why so 
many errors are observed, when link-level CRCs should catch nearly all of 
them.We have collected nearly 500,000 packets which failed the TCP or UDP or IP 
checksum. This dataset shows the Internet has a wide variety of error sources 
which can not be detected by link-level checks. We describe analysis tools that 
have identified nearly 100 different error patterns. Categorizing packet 
errors, we can infer likely causes which explain roughly half the observed 
errors. The causes span the entire spectrum of a network stack, from memory 
errors to bugs in TCP.After an analysis we conclude that the checksum will fail 
to detect errors for roughly 1 in 16 million to 10 billion packets. From our 
analysis of the cause of errors, we propose simple changes to several protocols 
which will decrease the rate of undetected error. Even so, the highly 
non-random distribution of errors strongly suggests some applications should 
employ application-level checksums or equivalents."



reply via email to

[Prev in Thread] Current Thread [Next in Thread]