[debbugs-tracker] bug#23113: closed (parallel gzip processes trash hard

From:

GNU bug Tracking System

Subject:

[debbugs-tracker] bug#23113: closed (parallel gzip processes trash hard disks, need larger buffers)

Date:

Tue, 12 Apr 2016 20:19:01 +0000

Your message dated Tue, 12 Apr 2016 13:18:18 -0700 with message-id <address@hidden> and subject line Re: bug#23113: parallel gzip processes trash hard disks, need larger buffers has caused the debbugs.gnu.org bug report #23113, regarding parallel gzip processes trash hard disks, need larger buffers to be marked as done. (If you believe you have received this mail in error, please contact address@hidden) -- 23113: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=23113 GNU Bug Tracking System Contact address@hidden with problems

--- Begin Message --- Subject: parallel gzip processes trash hard disks, need larger buffers Date: Fri, 25 Mar 2016 16:57:12 +0000

Hi there,

I am using gzip 1.6 to compress large files >10 GiB in parallel (Kubuntu 14.04, 12 cores). The underlying disk system (RAID 10) is able to deliver read speeds >1 GB/s (measured with flushed file caches, iostat –mx 1 100).

Here are some numbers when running gzip in parallel:

1 gzip process: the CPU is the bottleneck in compressing things and utilisation is 100%.

2 gzips in parallel: the disk throughput drops to a meagre 70MB/s and the CPU utilisation per process is at ~60%.

6 gzips in parallel: the disk throughput fluctuates between 50 and 60 MB/s and the CPU utilisation per process is at ~18-20%.

Running 6 gzips in parallel on the same data residing on a SSD: 100% CPU utilisation per process

Googling a bit I found this thread on SuperUser where someone saw the same behaviour already with a single disk doing normally 125 MB/s and running 4 gzips drops it to 25 MB/s:

http://superuser.com/questions/599329/why-is-gzip-slow-despite-cpu-and-hard-drive-performance-not-being-maxed-out

The posts there propose a workaround like this:

buffer -s 100000 -m 10000000 -p 100 < bigfile.dat | gzip > bigfile.dat.gz

And indeed, using “buffer” resolves trashing problems when working on a disk system. However, using “buffer” is pretty arcane (it isn’t even installed per default on most Unix/Linux installations) and pretty counterintuitive.

Would it be possible to have bigger buffers by default (1 MB? 10 MB?) or have an automatism in gzip like “if file to compress >10 MB and free RAM >500MB, setup the file buffer to use 1 (10?) MB” ?

Alternatively, a command line option to manually set the buffer size?

Best,

Bastien

--
DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615

DISCLAIMER:
This e-mail is for the intended recipient only.
If you have received it by mistake please let us know by reply and then delete it from your system; access, disclosure, copying, distribution or reliance on any of it by anyone else is prohibited.
If you as intended recipient have received this e-mail incorrectly, please notify the sender (via e-mail) immediately.

--- End Message ---

--- Begin Message --- Subject: Re: bug#23113: parallel gzip processes trash hard disks, need larger buffers Date: Tue, 12 Apr 2016 13:18:18 -0700

On Tue, Apr 12, 2016 at 9:55 AM, Chevreux, Bastien
<address@hidden> wrote:
> Mark,
>
> I knew about pigz, albeit not about -b, thank you for that. Together with -p 
> 1 that would replicate gzip and implement input buffering well enough to be 
> used in parallel pipelines (where you do not want, e.g., 40 pipelines running 
> 40 pigz with 40 threads each).
>
> Questions: how stable / error proof is pigz compared to gzip? I always shied 
> away from it as gzip is so much tried and tested that errors are unlikely ... 
> and the zlib.net homepage does not make an "official" statement like "you 
> should all now move to pigz, it's good and tested enough." Additional 
> question: is there a pigzlib planned? :-)

I expect pigz is stable enough to use with very high confidence.
Paul and I are notoriously picky about such things, and would not be
considering how to deprecate gzip in favor of pigz or to make gzip a
wrapper around pigz if we did not have that level of confidence.

One question for Mark: do you know if pigz has been subjected to AFL's
coverage-adaptive fuzzing? If not, it'd be great if someone could find
the time to do that. If someone does that, please also test an
ASAN-enabled binary and tell us how long the tests ran with no trace
of failure.

For reference, here's what happened when AFL was first applied to
linux file system driver code:
https://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing,%20Vault%202016.pdf.
If you read nothing else, look at slide 3, with its table of file
system type vs. the amount of time each driver withstood AFL-driven
abuse before first failure.

FYI, anyone can close one of these "issues," and I'm doing so simply
by replying to the usual address@hidden address, but with an
inserted "-done" before the "@": address@hidden

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

[debbugs-tracker] bug#23113: closed (parallel gzip processes trash hard disks, need larger buffers), GNU bug Tracking System <=