bug-tar
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-tar] use optimal file system block size


From: Christian Krause
Subject: Re: [Bug-tar] use optimal file system block size
Date: Thu, 19 Jul 2018 11:05:47 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

Dear all,

First, I would like to thank you all for your prompt replies.


To clarify: I do not mean to change the **record size**, which would result in 
an incompatible tar file. I am only interested in the buffer sizes that are 
used to read from and write to block devices.


As far as I understand it, please correct me if I didn't get it right, when 
creating a tarball with `tar cf data.tar data`, the following things happen:

1.  tar reads the input files from the data directory using the **input buffer 
size**
2.  tar creates records using the **record size**, which depends e.g. on 
command line arguments like `-b`
3.  tar writes the records to the output (block device file, STDOUT, character 
device / tape drive) using the **output buffer size**

There are three different **sizes** at work here: **input buffer**, **record**, 
and **output buffer**. The input buffer and output buffer sizes are the same as 
the record size, which can be verified using Ralphs command line with the `-b` 
option:

```
$strace -T -ttt -ff -o tar-1.30-factor-4k.strace tar cbf 4096 data4k.tar data

$ strace-analyzer io tar-1.30-factor-4k.strace.72464 | grep data | column -t
read   84M  in  1.520   s   (~  55M  /  s)  with  43  ops  (~  2M  /  op,  ~  
2M  request  size)  data/blob
write  86M  in  61.316  ms  (~  1G   /  s)  with  43  ops  (~  2M  /  op,  ~  
2M  request  size)  data4k.tar
```

Due to changing the **record size**, this creates a different, 
not-so-compatible tar file:

```
$ stat -c %s data.tar data4k.tar
88084480
90177536

$ md5sum data.tar data4k.tar
4477dca65dee41609d43147cd15eea68  data.tar
6f4ce17db2bf7beca3665e857cbc2d69  data4k.tar
```


Please verify: The fact that input buffer and output buffer sizes are the same 
as the record size is an implementation detail. The input buffer and output 
buffer sizes could be decoupled from the record size to improve I/O performance 
without changing the resulting tar file. Decoupling would entail a huge 
refactoring, like Jörg suggests.


What network filesystem are you using? Typically, such small IOPS
should be hidden from the filesystem with readahead and writeback
cache, though of course there is still more overhead from having
lots of system calls.

We are using IBM Spectrum Scale (previously known as GPFS). From the Spectrum 
Scale documentation I can see that it is using read-ahead and write-back 
techniques (I don't know much about the internals, though). The performance 
gain by reducing the number of syscalls and the resulting reduced overhead in 
both OS kernel and Spectrum Scale software components should still be 
measurable.


bsdtar has a similar optimization.

I can verify this for the input buffer size:

```
$ bsdtar --version
bsdtar 3.2.2 - libarchive 3.2.2 zlib/1.2.8 liblzma/5.0.4 bz2lib/1.0.6

$ strace -T -ttt -ff -o bsdtar-3.2.2-create.strace bsdtar -cf data-bsdtar.tar 
data

$ strace-analyzer io bsdtar-3.2.2-create.strace.14101 | grep data | column -t
read   84M  in  388.927  ms  (~  216M  /  s)  with  42    ops  (~  2M   /  op,  
~  2M   request  size)  data/blob
write  84M  in  4.854    s   (~  17M   /  s)  with  8602  ops  (~  10K  /  op,  
~  10K  request  size)  data-bsdtar.tar
```

This is not the latest version, though. Might be they changed the write buffer 
size in later versions, too.

Best Regards

On 07/19/2018 06:20 AM, Tim Kientzle wrote:
bsdtar has a similar optimization.

It decouples reads and writes, allowing it to use a more optimal size for each 
side.

When it opens an archive for writing, it checks the target device type.  If 
it’s a character device (such as a tape drive), it writes the requested blocks 
exactly.  When the target device is a block device, however, it instead buffers 
and writes much larger blocks, padding the file at the end as necessary to 
ensure the final size is a multiple of the requested block size.  This produces 
the exact same end result as if it had written blocks as requested but much 
more efficiently.

Tim


On Jul 18, 2018, at 9:58 AM, Andreas Dilger <address@hidden> wrote:

On Jul 18, 2018, at 9:03 AM, Ralph Corderoy <address@hidden> wrote:

Hi Christian,

$ stat -c %o data/blob
2097152
...
**tar** does not explicitly use the block size of the file system
where the files are located, but, for a reason I don't know (feel free to 
educate me), 10 KiB:

Historic, that being 20 blocks where a block is 512 B.  See `Blocking
Factor'.  https://www.gnu.org/software/tar/manual/tar.html#SEC160

It can be changed.

   $ strace -e write -s 10 tar cbf 4096 foo.tar foo
   write(3, "foo\0\0\0\0\0\0\0"..., 2097152) = 2097152
   +++ exited with 0 +++
   $

I would like to propose to use the native file system block size in
favor of the currently used 10 KiB.

I can't see the default changing.  POSIX's pax(1) states for ustar
format that the default for character devices is 10 KiB, and allows for
multiples of 512 up to an including 32,256.  So you're suggesting the
default is to produce an incompatible tar file.

The IO size from the storage does not need to match the recordsize
of the tar file.  It may be that writing to an actual tape character
device needs to use 10KB writes, but for a regular file on a block
device (which is 99% of tar usage) it can still write 10KB records,
but just write a few hundred of them at a time.

What network filesystem are you using?  Typically, such small IOPS
should be hidden from the filesystem with readahead and writeback
cache, though of course there is still more overhead from having
lots of system calls.

Cheers, Andreas


--
Christian Krause

Scientific Computing Administration and Support

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Email: address@hidden

Office: BioCity Leipzig 5e, Room 3.201.3

Phone: +49 341 97 33144

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig

Deutscher Platz 5e

04103 Leipzig

Germany

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

iDiv is a research centre of the DFG – Deutsche Forschungsgemeinschaft

iDiv ist eine zentrale Einrichtung der Universität Leipzig im Sinne des § 92 
Abs. 1 SächsHSFG und wird zusammen mit der Martin-Luther-Universität 
Halle-Wittenberg und der Friedrich-Schiller-Universität Jena betrieben sowie in 
Kooperation mit dem Helmholtz-Zentrum für Umweltforschung GmbH – UFZ. 
Beteiligte Kooperationspartner sind die folgenden außeruniversitären 
Forschungseinrichtungen: das Helmholtz-Zentrum für Umweltforschung GmbH - UFZ, 
das Max-Planck-Institut für Biogeochemie (MPI BGC), das Max-Planck-Institut für 
chemische Ökologie (MPI CE), das Max-Planck-Institut für evolutionäre 
Anthropologie (MPI EVA), das Leibniz-Institut Deutsche Sammlung von 
Mikroorganismen und Zellkulturen (DSMZ), das Leibniz-Institut für 
Pflanzenbiochemie (IPB), das Leibniz-Institut für Pflanzengenetik und 
Kulturpflanzenforschung (IPK) und das Leibniz-Institut Senckenberg Museum für 
Naturkunde Görlitz (SMNG). USt-IdNr. DE 141510383




reply via email to

[Prev in Thread] Current Thread [Next in Thread]