[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip

From: Antonio Diaz Diaz
Subject: Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip
Date: Thu, 18 May 2017 22:07:54 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv: Gecko/20110420 SeaMonkey/2.0.14

Damir wrote:
But calculating the CRC is just a small part of the total decompression
time. So, even if you accelerate it, the total speed gain is small.
(Probably smaller than 5%). For compression the speed gain is even smaller.

I can cite your own lzip benchmark, when comparing uncompression peformance
of lunzip vs busybox unxz, enabling crc in unxz (by using xz with crc32)
gives performance penalty of 16.7% (9.723s vs 8.331s). That's more
convincing number than 5%.

Take into account that busybox, gzip and lzip use the classical Sarwate algorithm because it is simple, safe, and runs on all architectures. In the case of lzip, the CRC calculation takes about 10-19% of the total decompression time (depending on dictionary size and compression ratio).

Maybe the fair thing to do is to compare hardware acceleration with software acceleration. This paper[1] states that:

"First, a 'slicing-by-4' algorithm doubles the performance of existing software-based, table-driven CRC implementations based on the Sarwate (August 1988) algorithm while using a 4K cache footprint. Second, a 'slicing-by-8' algorithm triples the performance of existing software-based CRC implementations while using an 8K cache footprint."

[1] http://www.researchgate.net/publication/4167696_A_systematic_approach_to_building_high_performance_software-based_CRC_generators

I do not have any figures about how much hardware acceleration accelerates the calculation of the CRC, but if the paper above is right, slicing-by-8 can reduce the software calculation time to a 3.5-7% of the total decompression time, leaving little margin for further improvements.

CRC32C has a slightly larger Hamming distance than ethernet CRC32 for
"small" packet sizes (see pags 3,4 of [1]). But beyond some size perhaps
not much larger than 128 KiB, both have the same HD of 2. For files
larger than that (uncompressed) size, there is little diference between
both CRCs.

That's not accurate at all. According to Koopman's crc32 zoo, crc32c gives
hd=4 on 2 gigabits, while ethernet crc is 92 kilobits.

You are right, the paper I cited was old and incomplete. But if I understand CRC Zoo correctly as explained at[1], from 2 gigabits to 4 gigabits ethernet CRC32 has a HD of 3 while CRC32C has a HD of 2. These sizes (268 MB to 537 MB) are also well within typical lzip usage.

Also, at a size of 4294967263 bits (537 MB), ethernet CRC32 has been confirmed to feature the longest dataword at HD 3 for all possible CRC32 polynomials. Twice that of CRC32C.

[1] http://users.ece.cmu.edu/~koopman/crc/notes.html

That's a valid point. There is no good error model describing corruption in
uncompressed file resulting from typical errors in compressed files (BER,
burst errors, nand page error, hdd sector error). Still, better HD on
typical sizes is preferred.

In my experience with LZMA compression, there are two kinds of errors not caught by the decoder: 1) Without error multiplication. The error affects a single byte and is guaranteed to be detected by both CRCs (ethernet and CRC32C). 2) With error multiplication. The total number of bits flipped is larger than 4, which both CRCs can detect about equally well.

I have performed many millions of trial decompressions and never found any indication that the current CRC is not perfectly adequate.

  New decompressor can decompress both old files and new ones, and old
decompressor can decompress new ones, but can't check its integrity. Large
downside, but not very large.

It does not work like that:

$ lzip -t v2.lz
  v2.lz: Version 2 member format not supported.

Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]