[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#23113: parallel gzip processes trash hard disks, need larger buffers
From: |
Chevreux, Bastien |
Subject: |
bug#23113: parallel gzip processes trash hard disks, need larger buffers |
Date: |
Tue, 29 Mar 2016 23:03:44 +0000 |
> From: address@hidden [mailto:address@hidden On Behalf Of Jim Meyering
> [...]
> However, I suggest that you consider using xz in place of gzip.
> Not only can it compress better, it also works faster for comparable
> compression ratios.
xz is not a viable alternative in this case: use case is not archiving. There
is a plethora of programs out there with zlib support compiled in and these
won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times
faster than xz -1 on FASTQ files (sequencing data), and the use case here is
"temporary results, so ok-ish compression in a comparatively short amount of
time". Gzip is ideal in that respect as even at -1 it compresses down to
~25-35% ... and that already helps a lot when you do not need 1 TiB of hard
disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day.
> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger
> values makes a significant difference, we'd like to hear about the results
> and how you measured.
Changing the INBUFSIZ did not have the effect hoped for as this is just the
buffer size allocated by gzip ... but in the end it uses only 64k at most and
the calls to the file system read() even end up to request only 32k per call.
I traced this down through multiple layers to the function fill_window() in
deflate.c, where things get really intricate using multiple pre-set variables,
defines and memcpy()s. It became clear that the code is geared towards using a
64k buffer with a rolling window of 32k. Optimised for 16 bit machines that is.
There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via
defines. However, code comments say that BIG_MEM would work on a complete file
loaded in memory ... which is a no-go for files in the area of 15 to 30 GiB.
I'm not even sure the code would be doing what the comments say.
Long story short: I do not feel expert enough to touch said functions and
change them to provide for larger input buffering. If I were forced to
implement something I'd try it with an outer buffering layer, but I'm not sure
it would be elegant or even efficient.
Best,
Bastien
PS: then again I'm toying with the idea to write a simple gzip-packer
replacement which simply buffers data and passes it to zlib.
--
DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615
________________________________
DISCLAIMER:
This e-mail is for the intended recipient only.
If you have received it by mistake please let us know by reply and then delete
it from your system; access, disclosure, copying, distribution or reliance on
any of it by anyone else is prohibited.
If you as intended recipient have received this e-mail incorrectly, please
notify the sender (via e-mail) immediately.