rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors


From: Bob Mead
Subject: Re: [rdiff-backup-users] Regression errors
Date: Mon, 20 Apr 2009 15:42:30 -0700
User-agent: Thunderbird 2.0.0.18 (X11/20081125)

Hi Maarten:
Thanks for your message - much to think about!

Maarten Bezemer wrote:
Hi,

Maybe a little late, but here goes.


On Tue, 31 Mar 2009, Bob Mead wrote:

The BUG locked Processor error was a long time ago and according to an article Andrew directed me to, it was due to a problem with ubuntu 8.04. At the time, I did run memtest on the server that gave up the error for some hours and it never did fail or find any errors. I have not seen that particular error since then and I am not using ubuntu 8.04 anymore as a result.

Depending on the amount of memory in the machine, 'some hours' may or may not have been enough to find certain errors. I've seen machines throwing up only 1 error in a 12-hour run of memmxtest (and again only 1 error in two repeated 24-hour runs), so that error was consistent but not triggered easily. So, if you have the opportunity to do more extensive tests (e.g. over the weekend), please do, just to be sure.


Third, I re-read some of your emails about your situation and what you've been trying to do. Having missing metadata files also might indicate hardware problems. Or maybe it's something related to kernel versions and data corruption on your file systems. Either way, it's pretty bad.

I do not have missing metadata files that I know of. I mis-typed "current-metadata" files for "current-/mirror"/ files in my most recent post. At Andrew's suggestion, I had adjusted the Current Mirror file to indicate a prior time to 'fool' rdiff into believing that it had not already run. When I did this (by renaming the file with an earlier date), rdiff did run, but complained about not finding the metadata files and said that it would use the filesystem instead. The backup has not run properly since then.

I don't know exactly what happens when you fool rdiff-backup like that. If it uses the 'current-mirror' marker as "the timestamp indicated in the current-mirror marker is taken as 'now', and all files found in the tree should match this 'now'", then you could very well break things if a subsequent (possible unfinished) rdiff-backup run changed the files. In that case, mirror_metadata wouldn't match the real file contents. Also, applying reverse-diffs to another version of the file than they were built for, could screw up things badly.

When I look at the source, it is not clear to me what is the case. Maybe someone with more extensive experience with the sources can comment on this?
I have since moved this data set to a new repo. It has been working fine in its new home. As my scripts remove all increments older than two months, I will wait another few weeks and then delete the original repo and its now [probably] hopelessly broken data set.

Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30 12:40:39 PST 2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel GNU/Linux.

This is a bit ancient. However, I didn't find any reports on known bugs in this version causing memory or filesystem corruption.

Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue Feb 12 17:08:38 UTC 2008 x86_64 GNU/Linux

Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 i686 GNU/Linux.

Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 i686 GNU/Linux

These are fairly recent kernels. As far as my information goes, there was a known bug in 2.6.27 prior to 2.6.27.10 related to file locking. I'm not sure if this was fixed in your 2.6.27-11 build (2.6.27-11 not being the same as 2.6.27.11). If you're using a current ubuntu release and have the latest kernel available for that release, you should be OK.
The last two are both fresh ubuntu 8.10 installs [using the default kernel supplied]. The older kernel [2.6.15-51] is undoubtedly the default kernel supplied with that distro. So it sounds like there are no kernel issues known at this time.



 You wrote earlier that upgrading or doing just anything with the
  server running rdiff-backup 1.0.4/1.0.5 is out of the question
  because of lack of resources. An alternative might be to first use
  rsync to synchronise your data to another server, and then use
  rdiff-backup from there. That gives you the opportunity to "play
  around" with different rdiff-backup versions without risking a
  "total breakdown" of the primary server.

Again, lack of resources prevents me from doing this on a network wide basis. I don't have any spare servers to rsync to and the time it would take to do that and then try to rdiff that result somewhere else is beyond the carrying capacity of our network and/or available times/bandwidths. I am actually working on a buildout of additional servers for placement at each remote site which will act as local backups and I will be doing exactly that (rsync to that new local machine and then rdiff from there to the backup server) however that project may take some months to complete.

Well, it seems that (at this time at least) you have a 'somewhat' broken backup system. Some would say a broken backup system is worse than no such system at all (since having one makes people believe the data is safe and all). So, if that's fine with your boss then you're out of luck. Otherwise this might be a perfect reason to have some additional resources assigned to your work. It's just a matter of how valuable the data is, and what the consequences are when it is lost. Would you be fired, or would the blame be on your boss.. ;-)

I'll get back to this below..
Its more of 'how do we provide the best solution we can with the resources we have at hand'. Yes, the data is important, no one will be fired if it gets lost, and most importantly - there are no additional resources. So I have to make do with what I've got. I agree that a broken backup system is less than ideal - hence its been handed to me as job #1 to make it work.

I do have the all new site-backup-servers deployed now and I rsync each of the site-servers to their respective site-backup-servers daily. Since all of the new site-backup-servers are ubuntu 8.10 installs and the 'new' backup server is also an 8.10 install, I am hoping to move all the rdiff backups to use the new servers (all of which run v1.1.16 included as part of the 8.10 repository). It is my hope that running one version of rdiff will simplify things to some degree.


Based on my traceback results, have the regressions actually failed? All I see is messages about regressing destination now. There doesn't ever seem to be any message about what happens after that.

There's always the --check-destination-dir switch you can run locally on the backup server, to see if the backup tree needs to be cleaned up. The regression is done automatically at start of a normal backup run when rdiff-backup finds an unclean backup tree, but running rdiff-backup --check-destination-dir does only just the cleanup. You might want to try that.
I tried this, with varying results. Some of my most recent problem backups (2 to be exact) returned a 'crc-check error'. One other returned with 'OK'. And still another returned with:

Traceback (most recent call last):
 File "/usr/bin/rdiff-backup", line 23, in ?
   rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 285, in Main
   take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 257, in take_action
   elif action == "check-destination-dir": CheckDest(rps[0])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 854, in CheckDest
   need_check = checkdest_need_check(dest_rp)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 890, in checkdest_need_check
   assert len(curmir_incs) == 2, "Found too many current_mirror incs!"
AssertionError: Found too many current_mirror incs!

I tried renaming [separately] both the oldest and newest current_mirror files in the rdiff-backup-data directory which then threw up errors about not finding an appropriate meta-data to regress to. Any ideas on how to remedy this, short of starting over with a new repo? When I googled the answer, I found a thread regarding using 'rsync with --delete' to remove extra current_mirror files - is there a way to do this with rdiff?

If you're running a recent version of rdiff-backup, you could also try the --verify switch to see it the files in the backup repo match the checksum recorded in the metadata file.
Does v1.1.16 support the --verify option?


On the other hand, you once mentioned that one of the servers had a clock that was way off. Only recently I saw something on this mailing list about using clocks of different sides for doing calculations that should have been using clocks at only one side. Maybe you ran into a similar issue that screwed up your repo?

 If you insist trying to fix this "the software way", I have a
 suggestion for you. The second point problem in your email talks
 about a 23-hour run of rdiff-backup. Given the size of the backup,
 I'd say that this was an initial run and there aren't some hundreds
 of increments in play here?

From my original post (below): "This backup data (241GB) set took several tries to get to run properly, however it did complete successfully on 3/23 (after running for 23 hours to complete)". Perhaps this is not as clear as I thought. Yes, this is the initial run and no there are not any increments. Your wording here leads me to believe that you think this is an erroneous question, perhaps one that ought not to be answered, at least here, or by you. I am not 'insisting' on anything. I asked the list for help on two particular problems I am having - nothing more. If it turns out that it is not the case that either problem I am having has anything at all to do with software, I am more than happy to look elsewhere to solve the problems. I wish I had the experience to see the 'CRC check failed' and immediately go to 'hardware issue'. Unfortunately, I don't. So I ask questions. I apologize if my asking has upset you.

I'm not upset, although my wording could have been a bit unfriendly.
There are a number of things you can try here. Given the fact that it is a large amount of data, we can use it to at least detect some hardware problems.
For example, try this:
# cd /path/to/dataset/location
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run1
# find . -type f -print0 | sort -z | xargs -0r md5sum > /tmp/md5sum_run2
# md5sum /tmp/md5sum_run*
And check that both /tmp/md5sum_run* files have the same checksum. They should have, if there's no rdiff-backup process running.
If the checksums don't match, try:
# diff -u /tmp/md5sum_run1 /tmp/md5sum_run2 | less
And look for the differences. Maybe just one line, maybe a lot of lines.
Do these tests both on the source and on the backup machines.
I will add this to run as a script after the rsync commands in the nightly synchronization process at the source-backup servers. Depending on the output of that, then I can try the #diff ... part to see what changes. Running this on the source would present a greater challenge as the data set is comprised of /home, /var, /etc, and /root with some exclusions.

I've seen cases where some combinations of chipsets, processors and memory chips go weird. For example a mainboard based on a Via KT400a chipset, a FSB266 processor and DDR400 ram modules. Memtest didn't find any problems, file checksums were usually right, but 1 out of 20 times orso they didn't match. Your 200+ GB dataset is likely to show these problems in two runs, but you are of course free to do more tests, creating /tmp/md5sum_run3, etc. I found that clocking the ram at 133MHz instead of 200 (i.e., matching ram speed to fsb speed) made the system stable.

Depending on how fast and how often the contents of your dataset change, you could also compare the source and backup /tmp/md5sum_run1 files. When the data changes often, this might be a bit pointless, but see below.


If so, could you try rsync with the --checksum argument to synchronise the backup to the source and see if there are more files being updated that should not have been changed, based on file modification time stamps. If you see such files then you're probably just out of luck and need some hardware replaced. Either in your computers or the networking equipment.
Since this is the initial run, there are only files that have changed (all of them) in the repo. I guess I'm not clear on what you're wanting to see here. If I rsync the repo as is, to the source I'm going to see what? Since there is only one backup, and it is the initial run, how will rsyncing that run back to the source files tell me about changed files?

I wasn't entirely clear on this. Normally, rsync bases its decision to sync file contents only on file modification timestamps and sizes. So, files that are corrupted but have the same size and timestamps will not get 'repaired'. When you add the --checksum argument, all files will get checksummed to see if they still match. If you have files in your repo that are not supposed to change often, but are updated when you run rsync with the --checksum argument, this can point to problems. Either with the way they were transferred initially, or with the hardware.

If you don't see any unchanged files being updated, then we're left with the question why rdiff-backup sees a failed CRC checksum. If you didn't mess with metadata files on the given repository, we're looking at some data corruption issue.
I haven't messed with any metadata files. The source data is rsynced daily from the server that it is replacing (new-server runs rsync -aH at 11pm daily to syncronize with old-server). Then that rsynced data set is rdiff'd to the backup server (new server pushes rdiff-backup at 4pm daily). I purposely have the rdiff sessions start before the rsync sessions to allow rsync to run overnight before the next day's rdiff. Perhaps the data is being corrupted by the rsync process?

Now the situation is getting more clear to me. What I understand is that you have:
1) source-server:/path/to/data
2) backup-server:/path/to/rsynced-data
3) backup-server:/path/to/rdiff-backup-tree

And you use rsync to sync 1) to 2) and then rdiff-backup to sync 2) to 3). Meaning that at the backup-server you have two times the dataset, once in /path/to/rsynced-data and once in /path/to/rdiff-backup-tree, and these locations are not shared.
You are close. In actual fact, I have 1) as you have described and 2) [which I'll call the 'source-backup-server' as per your naming convention] and I do use rsync to sync these two. Then I have 3) as you describe although its a different physical machine [and in a different location] from 1) or 2) [I have 10 separate sites each with both a 'source' and a 'source-backup' server]. I currently use rdiff to backup from the 'source' servers to the backup server. I am hoping to migrate to using rdiff to backup [to 3)] the data synced to 2) [source-backup servers] but I haven't been able to implement that yet.

In that case, you could schedule a find|sort|xargs md5sum thing at the source-server and at the backup-server right after the rsync run finishes. Given the time, I'd expect the data usually doesn't change during the nights. Then, try to compare these md5sums files and see if they differ: they shouldn't.



As an aside, even if you don't want to rebuild your servers, there still are some ways to compile a new version of rdiff-backup. I had to do this once for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 just yet. It turned out to be relatively easy to install python2.4 + librsync + rdiff-backup in my own home directory, and have multiple versions in active use by not using the standard python site-packages location but setting some environment variables.
I am having enough troubles getting the versions I have to work successfully. None of the errors I am seeing have ever been described as "fixed, upgrade and you will not see these any more". I have seen only one problem that Andrew described as giving a better message in newer versions.

If we don't get any further with the suggestions above, would you consider trying a new version of rdiff-backup if provide you with a recipe to build it, separated from the normal rdiff-backup package? I'd be willing to help you with that, just to see what we can find. But first, try the suggestions above, maybe we can resolve the issue without it.

Do you think that the older versions of rdiff that I use currently (v1.0.4 and 1.0.5) are in any way causing the errors I am seeing? No one has previously indicated that it is the software version(s) I am using that are the cause of the error(s).

Thanks for your help and suggestions.
   ~bob

Attachment: bmead.vcf
Description: Vcard


reply via email to

[Prev in Thread] Current Thread [Next in Thread]