[Duplicity-talk] comments on tar and new file format

duplicity-talk
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Duplicity-talk] comments on tar and new file format

From:	Travis H.
Subject:	[Duplicity-talk] comments on tar and new file format
Date:	Thu, 31 Aug 2006 22:42:37 -0500
``Not seek()able inside container: Because tar does not support
encryption/compression on the inside of archives, tarballs nowadays
are usually post-processed with gzip or similar. However, once
compressed, the tar archive becomes opaque, and it is impossible to
seek around inside. Thus if you want to extract the last file in a
huge .tar.gz or .tar.gpg archive, it is necessary to read through all
the prior data in the archive and discard it!''

Also, if you go from compressing one kind of file to another without
clearing out
the dictionary, then it will not compress as well.

What you really want is to intelligently group files, perhaps based on
extension,
or using the output of file(1).  Then you compress the group as though it were
one unit.  However, that would conflict with your desire to seek around to
individual files.  Perhaps there is a way to store the compressor's state in
front of each file so that you can pick up decompression.  That would be a
lot of work, and involve a non-standard compression tool, but it would be
very efficient.

``Doesn't support many filesystem features: Tar also doesn't have
support for many features of modern filesystems including access
control lists (ACLs), extended attributes (EAs), resource forks or
other metadata (as on MacOS for instance), or encoding types for
filenames (e.g. UTF-8 vs ascii).''

Yes, this is one of my pet peeves and also applies to network file systems in
addition to archive files.  That is why I use dump/restore on my BSD systems.

Tar files also record a lot of information that is very non-useful
when distributing
software (where it is very common).  Does the remote system really need to
know what uid/user owned the file?  Do they need to know the mtimes and
atimes of each file?  Also, if you do extract it and re-tar it, I
think it is nearly impossible to get the exact same tar file back, so
that you can't download a
tarball and PGP signature, extract the tarball, and then delete the tarball,
if you ever want to verify the signature again.  If it just contained content,
and didn't have any metadata, then it would be possible to do so.  Also,
being paranoid as I am, I wonder about the security implications of stored
metadata; there have been many privacy breaches due to data secreted
in document formats that are virtually invisible to the end user.  It violates
the "principle of least surprise".

There is a tool called pax which speaks tar,cpio, and a third kind, I
forget what.

However, it is probably insufficient for your purposes.

Of course I probably don't need to tell any cryptographers that you
want to do compression before encryption.  But fixed headers make for
known-plaintext attacks, and depending on the compression algorithm
several of the compressed
bits will be known (for example, the first datum may be prefixed with
a 0, and the second one similarly, unless it happens to be identical
to the first in which case it will start with a 1, and so on... with
probabilistic knowledge decreasing as we get farther and farther into
the file).  This is a consequence of the compression algorithm
bootstrapping its internal state/dictionary and with the
compressed/decompressed size ratio necessarily being more than one
while it learns about the file.  This is one reason to carry state to
similar files.

``Perhaps they will be in XML, or perhaps something simpler. Either
way, it should be easy to both parse and extend these sections.''

Please not XML.  It's much harder to write proper XML (tag balance,
etc.) than most Unix configuration files.  Having textual config files
and terrific text
manipulation programs made Unix successful.  XML is popular, but it's
also a solution in search of a problem.  The parsers are not exactly
lightweight.
Consider perhaps a simple format like the one used in LISP/scheme.
There's no arbitrary distinction between attributes and child tags,
no "needs a close tag/doesn't need a close tag" arbitrariness, and even vi
can show balanced braces.

For the binary data, I can imagine a simple syntax:
(bin <length in bytes> <binarydata>)

Note that the length doesn't have to be a fixed-length field, and that "(bin" is
predictable and reasonably uncommon (1 in 2^32 chance of appearing
at random).  However, parens in the binary data would break balancing
algorithms in the editor.  A simple option that wouldn't hurt overall size
too much would involve escaping them.

Of course vi won't like nulls too much, but it would be compact and
not terribly difficult to edit (emacs will handle it just fine).

You can of course support other encodings trivially;
(base64 ....)

The redundancy in the file, in case it gets corrupted or damaged; I
can appreciate
that you want a very robust format, but wouldn't this be better handled by a
wrapper that processes the whole file with an error-correcting code or
something?  Although TCP isn't perfect (the checksum can be fooled by
changing 0xFFFF to 0x0000 for example, due to the fact that both are
zero in one's complement),
it usually works to prevent corruption (also at the IP level I think)
and file transfer protocols often implement end-to-end integrity
checks.  I think it's safe to assume
that transmission errors will be detected by another layer.  If you don't create
some general-purpose error-correction mechanism, this will generate a need
and spawn a dozen homebrew tools for dealing with different kinds of corruption.
Nobody will be interested in maintaining them for long after they fix their own
problem, and it will be a mess.  The only data corruption error I can
see occuring
often without detection is truncation due to a full disk.

It would be trivial to wrap the whole thing in a crc32 format, so that
typical errors
can be corrected automatically.  On the other hand, it may be useful to have a
configurable error-correcting/detecting code for each file, so that you needn't
process the whole archive to extract one file and test its integrity.

Another thing I'm interested in is storing cryptographically-secure hashes, and
signing the hashes, so that by transitivity one can test the integrity
of any file
without having to store a signature on each one.  It should allow for drop-in
replacements, because every hash considered secure a few years ago is
now vulnerable to collisions.  Yes, MD5 is broken, SHA-1 is broken.  I like
SHA-512 and Whirlpool, but perhaps the end user wants to store both, so that
a break in one will not weaken the overall combination.  Overall, hashes have
to get bigger or die due to the combined effects of the birthday attack and
Moore's law.

Also, per-file encryption keys would be neat, so you could partially disclose
the contents.  For whole-file encryption, just use GPG.  Remember to keep
the format flexible so that different ciphers can store different-sized IVs
and such.

The diagrams on the website are nice, but I'm still very confused
about outer/inner blocks, block tables, inside indexes/block... can
someone elaborate?

I would encourage you also to look at file(1) and magic(5) and pick a
format that
is amenable to that kind of analysis.  Also, take a look at
pgpacket.pl, or its modern equivalent, with respect to making a binary
format intelligible enough to debug it.  In a sense it's like a
protocol analyzer, but for files.

``Why include two copies of the metadata?? It seemed to be the only
way to get both stream encoding/decoding and random access.''

I'm not sure I understand; does he anticipate that some tools will get
the file in reverse order?  If not, why can't they just check the copy
at the beginning of
the file, then seek?  Why do we need to sacrifice simplicity in writing?

Regarding sparse file support, YES, please do support it.  One annoying
prank users sometimes play is creating a HUGE sparse file and then
when the sysadmin goes to back up, they run out of space.

Perhaps it would make sense to make the format mimic a file system,
so that there's a simple mapping from one to the other, such that
sparse files are handled simply.
--
"If you're not part of the solution, you're part of the precipitate."
Unix "guru" for rent or hire -><- http://www.lightconsulting.com/~travis/
GPG fingerprint: 9D3F 395A DAC5 5CCC 9066  151D 0A6B 4098 0C55 1484
[Prev in Thread]
Current Thread
[Next in Thread]
[Duplicity-talk] comments on tar and new file format, Travis H. <=
Prev by Date: Re: [Duplicity-talk] backup space full
Previous by thread: [Duplicity-talk] backup space full
Index(es):
- Date
- Thread