[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Duplicity-talk] Re: [rdiff-backup-users] Pretty pictures and new versio

From: Ben Escoto
Subject: [Duplicity-talk] Re: [rdiff-backup-users] Pretty pictures and new version of proposal
Date: Tue, 30 Sep 2003 14:05:42 -0700

On Mon, 29 Sep 2003 14:17:31 -0500
John Goerzen <address@hidden> wrote:
> That is exactly one thing I was thinking :-) I really don't see what
> it buys anybody.  If the index contains an offset to the start of
> the metadata in the regular stream, is that not enough?  Any
> extraction problem could seek to that offset, read the metadata and
> continue reading straight on into the file's data.

The idea is that most searches through the space will be done using
metadata, not the actual contents.  So you are right, if I want the
contents of some particular file, I can just go there anyway.  But
suppose I am looking for all files which have changed in the last
month.  This could be done just by reading a small part of the
archive, instead of moving through the entire thing.

> Also, I don't know what storing the contents of a directory does for
> you, since simply scanning the index could give that informtaion
> anyway.

>From Will Dyson's and other's comments, when a file system looks up a
file, it expects to get its inode number from the directory contents.
Scanning the index is possible, but may make many operations very slow
I think.

> And finally, I think that the argument about the compressibility of the
> matadata is a non-starter since the format doesn't propose compressing the
> metadata (only the actual file data) and that's not something that's going
> to be good for random seeks and performance anyway.

Actually the proposal does call for the metadata to be compressed.

In general, as the page says, I expect that having two copies of the
metadata will only expand the archive by about 0.1%.  The increase in
the ease of searching and listing seems definitely worth it.  The only
doubt I have is that in the future average file size may drop by an
order of magnitude or more.  Then the tradeoff wouldn't be so obvious.

>  * You talk about requiring a root directory header.  Sometimes people just
>    want to store a file or three, and there is no real directory to list
>    as a root.

rdiff-backup and duplicity already work with singly-rooted file sets,
and this is probably necessary anyway to mount as a file system.
Three random files could be stored under the root (whose properties
may not be important).

>  * Regarding error correction -- every file should absolutely have some
>    sort of modern checksum (MD5, SHA, etc) associated with it.  Also,
>    file header blocks should start with a recognizable byte sequence,
>    so an extraction problem can make a reasonable attempt to recover an
>    archive starting at any arbitrary position within it (for instance,
>    if the dog ate the first 10 meters of tape)
>  * The information in the archive header should be instead (or better,
>    also) stored at the beginning of the index.  Otherwise, random
>    access will be worse.

Thanks, excellent advice.  You wouldn't want your archive to be toast
if the first 100 bytes get messed up.

>  * Some information in the archive header should be instead stored in
>    the file header.  This would allow, for instance, some files to be
>    compressed with gzip, others with bzip2, and still others with cat :-)


Ben Escoto

Attachment: pgpO3dMVaLo4o.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]