rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] How much metadata to store


From: Ben Escoto
Subject: Re: [rdiff-backup-users] How much metadata to store
Date: Mon, 02 Dec 2002 23:38:08 -0800

>>>>> "DG" == dean gaudet <address@hidden>
>>>>> wrote the following on Mon, 2 Dec 2002 15:08:39 -0800 (PST)

  DG> i noticed improved performance by enabling noatime,nodiratime in
  DG> the mount options for the mirror fs... but this was ages ago
  DG> with 0.6.x or 0.7.x i forget which.  these options eliminate
  DG> disk writes to update the atimes on files/directories which are
  DG> accessed -- and directories are considered accessed by
  DG> opendir().

  DG> i suspect that the real benefit is in not having to traverse the
  DG> mirror filesystem to get the filelist...

Yes, makes sense.  Plus it would probably be easier to write assuming
that all the metadata is stored in just a separate file, that way we
wouldn't have to keep switching back and forth.  I haven't thought
about corruption issues much though - what happens when the computer
crashes while rdiff-backup is writing the meta-data file, or what
happens when the mirror gets out of sync with the metadata.

    I don't think there will be any insurmountable problems,a but
there may be tricky cases.  In fact, this could raise the complexity
level a few notches.  Right now when updating the destination
directory rdiff-backup tries to change the mirror and the increments
"simultaneously" by writing everything first and then moving both
files into position one after another.  If on the off chance something
occurs in the meantime, I think rdiff-backup tries to back out the
process, and failing that something reasonable still happens.  With
the metadata file there would be 4 things that should happen
"simultaneously": writing to the mirror, making an increment file,
writing to the current metadata store, and writing to the metadata
increment.

  DG> and even better would be if you could avoid recalculating all
  DG> the signatures and retransmitting them.  it seems like you could
  DG> keep a copy of the mirror metadata on the mirror and the
  DG> primary, and use a signature comparison of the two at the
  DG> beginning of the backup to speed up the file selection.  this
  DG> would help a mirror scale to hundreds of primaries (i suspect
  DG> that the code today won't scale because the mirror has to parse
  DG> all of its files for every primary it has a mirror of).

When doing profiling I've never noticed signature calculation time to
be significant.  Of course it could be, for instance if there is one
huge file which changes all the time.  But I'm not sure if speeding up
signature calculation would actually help anyone.  (Of course tell me
if you've noticed something - that noatime trick is good to know.)

  DG> it'd be pretty cool to do a filesystem extension which allows
  DG> you to store an md5/sha1 of the file as an extended attribute
  DG> which is removed whenever the file is modified :)

Good idea.  Or instead of removing it, this increasingly improbable
filesystem could just have meta-metadata: the hash could be dated, and
we could assume the hash was up-to-date if the hash date (measured in
nanoseconds of course (ignoring the fact that my computer clock seems
to lose 5000000000 nanoseconds every day)) matched the file
modification date.  Also maybe some rsync signature data could be
stored as metadata.

  DG> it sure is convenient to have all the files available in the
  DG> mirror and to push the compression/packing problems onto the
  DG> filesystem.  (*)

Can we come up with some rule for when we would want to avoid the
filesystem and when we want to use it?  Right now it seems we have two
extremes:

Old rdiff-backup <----------------------------------------> duplicity

and are discussing moving rdiff-backup further to the right.
Duplicity bypasses the filesystem entirely and can be used against,
for instance, an ftp server.  Original rdiff-backup assumed that the
destination file system be used to store all data/metadata. 

    Now the rule I had in mind was that we can bypass the file system
if it doesn't provide necessary services (like certain metadata
functionality).  But now we are discussing bypassing it to speed
things up.  There's nothing wrong with that necessarily, but it might
be nice to have some more conception ground, to make sense of these
choices.

  DG> (*) i'd even extend this to encryption.  but i'm not sure there
  DG> are any really secure encrypted filesystems on free unix
  DG> yet... on linux, using the encrypted loopback mount is not
  DG> secure for large filesystem because such a filesystem has a vast
  DG> amount of predictable data (consider that a typical linux
  DG> install has about ~1GB of exe/lib/etc. data which is easy to
  DG> predict), which allows "known-plaintext" attacks against the
  DG> cipher.

I think encryption is fundamentally different.  Even if you had an
encrypted file system, if you don't trust the remote host, the data
would have to be sent to you encrypted, and then you would decrypt
it.  Plus to avoid giving away, for instance, the size and number of
various files, the system would have to send you blocks of encrypted
data.  So there would be no way to apply the rsync algorithm unless
the signatures were pre-computed.

    About known-plaintext, is this a big issue?  Whenever they do one
of those RSA or whatever challenges they tell everyone the message is
"The password is xxxxxxxxxx", and the key still ends up getting
brute-forced.  Anyway I think keys are generally expected to be
resistant to these kinds of attacks.


-- 
Ben Escoto

Attachment: pgpPuUdkGvM8U.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]