rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Q. on max-file-size behavior


From: Maarten Bezemer
Subject: Re: [rdiff-backup-users] Q. on max-file-size behavior
Date: Sun, 14 Mar 2010 22:27:33 +0100 (CET)


On Sun, 14 Mar 2010, Whit Blauvelt wrote:

On Sun, Mar 14, 2010 at 03:31:13PM +0100, Maarten Bezemer wrote:
I don't think this is even a corner case. If you want to exclude
large files, then a file that is larger than the limit you specify
(something you explicitly and deliberatly do!) should not be in the
backup. Also, it should not _remain_ in the 'current' backup tree,
because it would no longer match the original in the source tree.
Since rdiff-backup keeps history of the backups, there is no other
way than to treat it as 'deleted from the source'. That's the only
way to keep the history intact AND have a proper 'current' backup
tree.

Here's how the corner case occurs:
[snip]

I do understand when your 'problem case' happens. Not only would it happen when you lower the maximum file size in later runs, it would also happen when you have files steadily growing over the size limit. IF you tell rdiff-backup "I do not want files larger than X in my backup", then clearly all rdiff-backup can do is... not include them in the backup. There is no difference between "I don't want them" and "They don't exist", as far as the backup application is concerned. You don't want them? Fine, you don't get them. But you also don't get an older version since that would make no sense either.
Quoting from the manpage:
" When backing up, if a file is excluded, rdiff-backup acts as if that
  file does not exist in the source directory."


As far as intact history goes, that's a side issue here, isn't it?

No, it's not. That's the whole point.
If rdiff-backup didn't keep history, it could just remove the large file and be done with it. However, rdiff-backup was designed to be able to restore to previous points in time, for example to just before your manager accidentally removed the almost-finished $200.000 tender document that was due tomorrow. So, files that are no longer in the source tree (or files that you have excluded, either by name or by a size limit, no difference there) are not just deleted, but rdiff-backup creates a so-called snapshot and moves that to a proper place in the rdiff-backup-data directory. So, if you do need that file again, it only needs to restore the snapshot. That snapshot can later be deleted when you decide to remove parts of the history kept by rdiff-backup. (--remove-older-than)

The normal 'current' backup tree always contains the exact same files as the source tree. Rdiff-backup does never gzip files in the current tree. Only the snapshots and diffs in the rdiff-backup-data directory can, at the user's choice, be gzipped.


But..

I think your problem is not with the gzipping. I think you want to use rdiff-backup in a way it was never designed to be used. So, instead of commenting on several other "misunderstandings" in your email, I'll focus on what I think triggered this discussion:

That might not just avoid treating a file as if deleted on the original when
it hasn't been, but support actions like running rdiff-backup at regular
intervals during working hours just against smaller files, while running a
daily backup of even the large stuff every night, without having to
establish two redundant backup spaces to accommodate this.

That's just a Bad Idea (tm). The whole idea of "restore to a specific point in time" implies that you then get back the tree as it was at the time you specified. Not a tree with small files from that date/time, and with large files from an earlier date.

You do have a few options to get what you want.
For example, you could do a two-stage backup, using rsync to regularly sync the source tree to a shadow tree, and exclude-but-not-delete large files. And then use rdiff-backup to backup the shadow tree right after each rsync run. Overnight, run a full rsync and again a normal rdiff-backup, and it will update the larger files as well. This indeed uses a lot of extra disk space and thus sort of defeats the purpose.

So, why not just use both --max-file-size and --min-file-size on two separate backup trees? That would exclude the large files from the smallfiles-tree, and the small files from the largefiles-tree, so no redundancy. And you can use different backup schedules for both trees.

To make things more easy, I think I'd just create two backup trees, based on file paths. Huges files with sizes like you mentioned usually show up on well-defined places in a file system, and not just between a normal user's mozilla preferences file and a list of recently opened documents. So you could even use a --max-file-size for the normal backup tree, and warn the users that they CAN use larger files there, but they will NOT be backed up so no complaining if they get deleted, corrupted, or lost.


Good points. But let me rephrase the claims more clearly. (Language can be
too broad a brush for technical discussions.) If the user's goal is to
compromise [snip]

If you want to compromise, you don't get what you want, and also you get things you don't want. That's not only a matter of language, it's just something you don't want when designing a backup system. If you want speed (assuming, for the sake of argument, that gzipping is your only problem), just get larger disks. Extra 1TB of disk space costs way less than changing rdiff-backup to something it was never designed to be.

Plus, gzipping might indeed take eons to complete on a 16GB file, but your suggestion wouldn't do anything to improve the speed of:
- the part where librsync creates a local copy of the current version of
  the file in the source tree
- the part where a diff is created to be able to go from the current
  version to the previous version
- the part where that possibly large diff is stored into the
  rdiff-backup-data directory.
(Where the first two might very well take even more time than gzipping the file..)

Actually, your suggestion would only help for large files being deleted (or excluded) from the source tree. For your suggestion to be really useful, you would need to have a source tree that has this happening on a regular basis. And in that case, the time spent in gzipping will be so much less of a problem than the amount of disk space that will be used by all the increments. (Or you would need to keep such a short history that you shouldn't be using rdiff-backup at all, making this discussion moot anyway.)


Maarten




reply via email to

[Prev in Thread] Current Thread [Next in Thread]