lzip-bug
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format in


From: Antonio Diaz Diaz
Subject: Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving
Date: Tue, 28 Mar 2023 23:04:38 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14

Dear Wolfgang,

Thank you for your message. I'll try to clarify things and maybe improve the wording of Section 2.10.4.

Wolfgang Liessmann wrote:
in section "2.10.4 The 'Block Check' field" you present a formula
Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)
which is not explained (derived in a way so one sees its necessity).

That formula calculates inaccuracy by plain counting of cases. Let's see:

Inaccuracy is defined in the Glossary (http://www.nongnu.org/lzip/xz_inadequate.html#glossary) as:

Inaccuracy: Ratio of error detection mistakes. Is defined as (false_negatives + false_positives) / total_cases.

For the single-byte error model (one corrupt byte per test), the inaccuracy for each compressed size and check sequence (CS) size is calculated by the following formula (all sizes in bytes):

Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)

Where
'compressed_size * Pudc' are the false negatives (cases where the check sequence does not detect the corruption), 'CS_size' is the size of the check sequence, in bytes (cases where the corrupt byte belongs to the check sequence, not to the compressed payload), and 'compressed_size + CS_size' is the total size of the file (total number of cases).

In particular, the statement
  It should be noted that SHA-256 provides worse accuracy than CRC64 for
  all possible block sizes.
and the inaccuracy 3e-08 at 1e+09 bytes (= 1 GB) compressed size seems to
contradict simple cryptographic standards at the first glance.

This may be because "cryptographic standards" do not take into account mismatches caused by the corruption of the hash value itself.

If we take Pudc (the probability of a single-byte error being undetected by the check sequence) as 2^-n (See [Koopman], p. 5), where 'n' is the size of the check sequence in bits, then the inaccuracy for the 3 types of check sequences considered can be calculated for a compressed size 'x' as:

( x * 2^-32  +  4 ) / ( x +  4 ) for CRC32
( x * 2^-64  +  8 ) / ( x +  8 ) for CRC64
( x * 2^-256 + 32 ) / ( x + 32 ) for SHA-256

The largest block size that a xz file can contain is 2^63 bytes. So the inaccuracies for the largest size are:

( 2^63 * 2^-32  +  4 ) / ( 2^63 +  4 ) ~= 2.3e-10 for CRC32
( 2^63 * 2^-64  +  8 ) / ( 2^63 +  8 ) ~= 9.2e-19 for CRC64
( 2^63 * 2^-256 + 32 ) / ( 2^63 + 32 ) ~= 3.5e-18 for SHA-256

When we apply the formula '(false_negatives + false_positives) / total_cases' to a compressed size of 1 GB we get the following inaccuracies:

( 1e9 * 2^-32  +  4 ) / ( 1e9 +  4 ) ~= 4e-9   for CRC32
( 1e9 * 2^-64  +  8 ) / ( 1e9 +  8 ) ~= 8e-9   for CRC64
( 1e9 * 2^-256 + 32 ) / ( 1e9 + 32 ) ~= 3.2e-8 for SHA-256

Which matches figure 3 in the article.

The rule of thumb (the mathematical approximation) is that the security,
measured in bits of security, is half of the hash length (if there is no
attack on the algorithm, and there is no known effective attack on
SHA-256 as of now), which means that the probability of a collision of
two SHA-256 hashes is less than

With respect to the check sequence, the inaccuracy described in the article has nothing to do with cryptography, collisions, security, or attacks. It just relates to the size of the check sequence; the larger the check sequence, the more probable that corruption affects it (causing a false positive). Just to illustrate: if you use a check sequence as large as the compressed payload, half of the bit flips occurring in the resulting file will affect the payload and the other half will affect the check sequence (on average). It is immaterial whether the check sequence is a CRC or a cryptographic hash.

As a SHA-256 hash is 8 times larger than a 32-bit CRC, it is 8 times more probable that a random bit corruption affects it, making the decompressor believe that the payload is corrupt because it does not match the check sequence. Similarly, it is 4 times more probable that a random bit corruption affects a SHA-256 than a CRC64. Choosing a check sequence larger than needed is just a bad idea.

While at the first glance lzip indeed seems to have a better design than
xz, and the compression rate is better comparing the maximum compression
levels, it is not clear to me why a secure cryptographic hash function
should perform that badly.

This is explained in [Koopman], p. 33:

"there can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to a dataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding.

False-positive detections can occur in two primary ways. The first occurs because the check sequence adds more bits to the dataword and each bit that is added is another bit that could fail. For checksums and CRCs, a failure in the check sequence bits is indistinguishable from failures in the dataword. Thus, a failure in the check sequence will cause the data to be flagged as failed, which takes place because of the addition of the check sequence. Thus, the addition of the check sequence bits can cause the lack of availability of the dataword to which it is attached."

Best regards,
Antonio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]