[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#23113: parallel gzip processes trash hard disks, need larger buffers
From: |
Mark Adler |
Subject: |
bug#23113: parallel gzip processes trash hard disks, need larger buffers |
Date: |
Sun, 10 Apr 2016 00:49:17 -0700 |
Bastien,
pigz (a parallel version of gzip) has a variable buffer size. The -b or
--blocksize option allows up to 512 MB buffers, defaulting to 128K. See
http://zlib.net/pigz/
Mark
> On Mar 29, 2016, at 4:03 PM, Chevreux, Bastien <address@hidden> wrote:
>
>> From: address@hidden [mailto:address@hidden On Behalf Of Jim Meyering
>> [...]
>> However, I suggest that you consider using xz in place of gzip.
>> Not only can it compress better, it also works faster for comparable
>> compression ratios.
>
> xz is not a viable alternative in this case: use case is not archiving. There
> is a plethora of programs out there with zlib support compiled in and these
> won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times
> faster than xz -1 on FASTQ files (sequencing data), and the use case here is
> "temporary results, so ok-ish compression in a comparatively short amount of
> time". Gzip is ideal in that respect as even at -1 it compresses down to
> ~25-35% ... and that already helps a lot when you do not need 1 TiB of hard
> disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day.
>
>> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger
>> values makes a significant difference, we'd like to hear about the results
>> and how you measured.
>
> Changing the INBUFSIZ did not have the effect hoped for as this is just the
> buffer size allocated by gzip ... but in the end it uses only 64k at most
> and the calls to the file system read() even end up to request only 32k per
> call.
>
> I traced this down through multiple layers to the function fill_window() in
> deflate.c, where things get really intricate using multiple pre-set
> variables, defines and memcpy()s. It became clear that the code is geared
> towards using a 64k buffer with a rolling window of 32k. Optimised for 16 bit
> machines that is.
>
> There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via
> defines. However, code comments say that BIG_MEM would work on a complete
> file loaded in memory ... which is a no-go for files in the area of 15 to 30
> GiB. I'm not even sure the code would be doing what the comments say.
>
> Long story short: I do not feel expert enough to touch said functions and
> change them to provide for larger input buffering. If I were forced to
> implement something I'd try it with an outer buffering layer, but I'm not
> sure it would be elegant or even efficient.
>
> Best,
> Bastien
>
> PS: then again I'm toying with the idea to write a simple gzip-packer
> replacement which simply buffers data and passes it to zlib.
>
> --
> DSM Nutritional Products Microbia Inc | Bioinformatics
> 60 Westview Street | Lexington, MA 02421 | United States
> Phone +1 781 259 7613 | Fax +1 781 259 0615
>
>
> ________________________________
>
> DISCLAIMER:
> This e-mail is for the intended recipient only.
> If you have received it by mistake please let us know by reply and then
> delete it from your system; access, disclosure, copying, distribution or
> reliance on any of it by anyone else is prohibited.
> If you as intended recipient have received this e-mail incorrectly, please
> notify the sender (via e-mail) immediately.
- bug#23113: parallel gzip processes trash hard disks, need larger buffers,
Mark Adler <=