[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#23113: parallel gzip processes trash hard disks, need larger buffers
From: |
Chevreux, Bastien |
Subject: |
bug#23113: parallel gzip processes trash hard disks, need larger buffers |
Date: |
Fri, 25 Mar 2016 16:57:12 +0000 |
Hi there,
I am using gzip 1.6 to compress large files >10 GiB in parallel (Kubuntu 14.04,
12 cores). The underlying disk system (RAID 10) is able to deliver read speeds
>1 GB/s (measured with flushed file caches, iostat -mx 1 100).
Here are some numbers when running gzip in parallel:
1 gzip process: the CPU is the bottleneck in compressing things and utilisation
is 100%.
2 gzips in parallel: the disk throughput drops to a meagre 70MB/s and the CPU
utilisation per process is at ~60%.
6 gzips in parallel: the disk throughput fluctuates between 50 and 60 MB/s and
the CPU utilisation per process is at ~18-20%.
Running 6 gzips in parallel on the same data residing on a SSD: 100% CPU
utilisation per process
Googling a bit I found this thread on SuperUser where someone saw the same
behaviour already with a single disk doing normally 125 MB/s and running 4
gzips drops it to 25 MB/s:
http://superuser.com/questions/599329/why-is-gzip-slow-despite-cpu-and-hard-drive-performance-not-being-maxed-out
The posts there propose a workaround like this:
buffer -s 100000 -m 10000000 -p 100 < bigfile.dat | gzip > bigfile.dat.gz
And indeed, using "buffer" resolves trashing problems when working on a disk
system. However, using "buffer" is pretty arcane (it isn't even installed per
default on most Unix/Linux installations) and pretty counterintuitive.
Would it be possible to have bigger buffers by default (1 MB? 10 MB?) or have
an automatism in gzip like "if file to compress >10 MB and free RAM >500MB,
setup the file buffer to use 1 (10?) MB" ?
Alternatively, a command line option to manually set the buffer size?
Best,
Bastien
--
DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615
________________________________
DISCLAIMER:
This e-mail is for the intended recipient only.
If you have received it by mistake please let us know by reply and then delete
it from your system; access, disclosure, copying, distribution or reliance on
any of it by anyone else is prohibited.
If you as intended recipient have received this e-mail incorrectly, please
notify the sender (via e-mail) immediately.
- bug#23113: parallel gzip processes trash hard disks, need larger buffers,
Chevreux, Bastien <=