rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors


From: Bob Mead
Subject: Re: [rdiff-backup-users] Regression errors
Date: Tue, 31 Mar 2009 14:26:39 -0700
User-agent: Thunderbird 2.0.0.18 (X11/20081125)

Hello Maarten:
Thanks for your response. I will try to answer your points one at a time, hopefully clarifying what the problems are and steps I have taken to resolve them.

Maarten Bezemer wrote:
Hi Bob,

First, let me say that your situation is not quite like mine. I use rdiff-backup started from the backup server, so I'm doing "pull style" instead of your "push style". Also, I run rdiff-backup as normal user on the backup side and as root on the source side (since I obviously cannot read each user's files as non-root). Rdiff-backup keeps records of metadata separately, so I don't need root at the backup server.

Second, I looked back in my mail archives and found this:
When I call rdiff using this cmd, it locks up the destination server
(console showed 'BUG locked Processor 1 for 11s' messages).
I unfortunately cannot find any document describing this error message. However, it's possibly a kernel message. If so, it might indicate some of your hardware if dying. Failed CRC checksums mostly come from broken hardware (RAM, CPU, hard drives, hard drive cables, or power supply, to name just a few). There is nothing you can do about that with software.

The BUG locked Processor error was a long time ago and according to an article Andrew directed me to, it was due to a problem with ubuntu 8.04. At the time, I did run memtest on the server that gave up the error for some hours and it never did fail or find any errors. I have not seen that particular error since then and I am not using ubuntu 8.04 anymore as a result.

Third, I re-read some of your emails about your situation and what you've been trying to do. Having missing metadata files also might indicate hardware problems. Or maybe it's something related to kernel versions and data corruption on your file systems. Either way, it's pretty bad.

I do not have missing metadata files that I know of. I mis-typed "current-metadata" files for "current-/mirror"/ files in my most recent post. At Andrew's suggestion, I had adjusted the Current Mirror file to indicate a prior time to 'fool' rdiff into believing that it had not already run. When I did this (by renaming the file with an earlier date), rdiff did run, but complained about not finding the metadata files and said that it would use the filesystem instead. The backup has not run properly since then.

Before going any further, please make sure you're using reliable hardware at all servers. See if you don't have any leaking capacitors on your mainboards or inside the power supply. Next, get a memory testing program (memtest86+ or memmxtest) and have that run overnight. (I say overnight because it needs to run for at least a few hours and the server needs to be brought down which usually isn't possible during day-time hours.) Harddisk diagnostics tools might also come in handy.
If everything turns out to be OK, we could go suspect software bugs.
Oh, by the way, could you give us kernel versions of the machines you're using? (copy/paste the output of "uname -a" ) Some kernel versions are known for causing data corruption in certain file system types.
Problem #1:
Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30 12:40:39 PST 2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel GNU/Linux.

Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue Feb 12 17:08:38 UTC 2008 x86_64 GNU/Linux

Problem #2:
Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 i686 GNU/Linux.

Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 i686 GNU/Linux


   You wrote earlier that upgrading or doing just anything with the
   server running rdiff-backup 1.0.4/1.0.5 is out of the question
   because of lack of resources. An alternative might be to first use
   rsync to synchronise your data to another server, and then use
   rdiff-backup from there. That gives you the opportunity to "play
   around" with different rdiff-backup versions without risking a
   "total breakdown" of the primary server.


Again, lack of resources prevents me from doing this on a network wide basis. I don't have any spare servers to rsync to and the time it would take to do that and then try to rdiff that result somewhere else is beyond the carrying capacity of our network and/or available times/bandwidths. I am actually working on a buildout of additional servers for placement at each remote site which will act as local backups and I will be doing exactly that (rsync to that new local machine and then rdiff from there to the backup server) however that project may take some months to complete.


The things you wrote made me a bit nervous. Like this: "The work around seemed to be the renaming of the current meta-data file to a time prior to the next run of rdiff." Doing such things is very likely to screw up any repository... especially regressions to previous states WILL break when the metadata files are messed up. I've been using rdiff-backup for years now, and not a single time did a regress fail on me.
Based on my traceback results, have the regressions actually failed? All I see is messages about regressing destination now. There doesn't ever seem to be any message about what happens after that.
And yes, I've had rdiff-backup regress my repo quite often since ADSL links haven't always been as stable as the are today. Also, I never had to do special things to metadata timestamps or whatever.
see above - I meant to write 'current-mirror files' not 'current-metadata', my mistake.

On the other hand, you once mentioned that one of the servers had a clock that was way off. Only recently I saw something on this mailing list about using clocks of different sides for doing calculations that should have been using clocks at only one side. Maybe you ran into a similar issue that screwed up your repo?

   If you insist trying to fix this "the software way", I have a
   suggestion for you. The second point problem in your email talks
   about a 23-hour run of rdiff-backup. Given the size of the backup,
   I'd say that this was an initial run and there aren't some hundreds
   of increments in play here?

From my original post (below): "This backup data (241GB) set took several tries to get to run properly, however it did complete successfully on 3/23 (after running for 23 hours to complete)". Perhaps this is not as clear as I thought. Yes, this is the initial run and no there are not any increments. Your wording here leads me to believe that you think this is an erroneous question, perhaps one that ought not to be answered, at least here, or by you. I am not 'insisting' on anything. I asked the list for help on two particular problems I am having - nothing more. If it turns out that it is not the case that either problem I am having has anything at all to do with software, I am more than happy to look elsewhere to solve the problems. I wish I had the experience to see the 'CRC check failed' and immediately go to 'hardware issue'. Unfortunately, I don't. So I ask questions. I apologize if my asking has upset you.
If so, could you try rsync with the --checksum argument to synchronise the backup to the source and see if there are more files being updated that should not have been changed, based on file modification time stamps. If you see such files then you're probably just out of luck and need some hardware replaced. Either in your computers or the networking equipment.
Since this is the initial run, there are only files that have changed (all of them) in the repo. I guess I'm not clear on what you're wanting to see here. If I rsync the repo as is, to the source I'm going to see what? Since there is only one backup, and it is the initial run, how will rsyncing that run back to the source files tell me about changed files?
If you don't see any unchanged files being updated, then we're left with the question why rdiff-backup sees a failed CRC checksum. If you didn't mess with metadata files on the given repository, we're looking at some data corruption issue.
I haven't messed with any metadata files. The source data is rsynced daily from the server that it is replacing (new-server runs rsync -aH at 11pm daily to syncronize with old-server). Then that rsynced data set is rdiff'd to the backup server (new server pushes rdiff-backup at 4pm daily). I purposely have the rdiff sessions start before the rsync sessions to allow rsync to run overnight before the next day's rdiff. Perhaps the data is being corrupted by the rsync process?


As an aside, even if you don't want to rebuild your servers, there still are some ways to compile a new version of rdiff-backup. I had to do this once for some clients that didn't want to upgrade from 1.2.2 to 1.2.5 just yet. It turned out to be relatively easy to install python2.4 + librsync + rdiff-backup in my own home directory, and have multiple versions in active use by not using the standard python site-packages location but setting some environment variables.
I am having enough troubles getting the versions I have to work successfully. None of the errors I am seeing have ever been described as "fixed, upgrade and you will not see these any more". I have seen only one problem that Andrew described as giving a better message in newer versions.


I hope I gave you enough pointers to work with for now. Please report back to the list if you have any news.

Regards,
 Maarten


On Thu, 26 Mar 2009, Bob Mead wrote:

Hello all:
I have a series of rdiff-backups that run every day to backup 10 remote sites and total of 14 different servers. It seems that each day, at least one of the backups fails. I have been working at getting these to work flawlessly for 6 months, but it seems beyond my grasp. In the last week, I thought I was hot on the trail of a 'perfect run' but now I'm not so sure. For the past few days I have been having troubles with the same two servers' backups. These are push type backups (as are all my backup jobs) with the remote servers running backup scripts to rdiff to, in this case, two different destination/backup servers.

In the first case: older Gentoo Linux system (running v1.0.5, dest. has v1.0.4) the following commands: rdiff-backup --force --print-statistics --include /etc --include /home --include /var --include /root --exclude / / root@<servername>::/home/backups/dor rdiff-backup --force --remove-older-than 2M root@<servername>::/home/backups/dor

(I added the --force option to test if that would clear up the regression problem, it didn't)

returned this as output:
Previous backup seems to have failed, regressing destination now.
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 285, in Main
  take_action(rps)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 255, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 299, in Backup
  backup_final_init(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 396, in backup_final_init
  checkdest_if_necessary(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py", line 911, in checkdest_if_necessary
  dest_rp.conn.regress.Regress(dest_rp)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py", line 445, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py", line 367, in reval
  if isinstance(result, Exception): raise result
IOError: [Errno None] None: None
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in ?
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 285, in Main
  take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line 253, in take_action
  connection.PipeConnection(sys.stdin, sys.stdout).Server()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 352, in Server
  self.get_response(-1)
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 314, in get_response
  try: req_num, object = self._get()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py", line 230, in _get
  raise ConnectionReadError("Truncated header string (problem "
rdiff_backup.connection.ConnectionReadError: Truncated header string (problem probably originated remotely)

At some point recently (3/20), this backup worked. Then it started to fail, giving up regressing dest. errors each time it has run since then. This is the same backup I posted on recently where I had to 'pull the wool' over rdiff's eyes because of a server date malfunction. The work around seemed to be the renaming of the current meta-data file to a time prior to the next run of rdiff. That seemed to work in that it didn't complain about too many current mirror files, but it did make rdiff unable to 'see' the metadata file and therefore use the filesystem. Perhaps these problems are then related? If so, any ideas on how to get it working again would be greatly appreciated. There should be two months of increments stored in the repository so I don't want to lose those by starting over.

The second failed backup is a brand new install of ubuntu 8.10 running rdiff v1.1.16 pushing backups to another fresh 8.10 install also running rdiff v1.1.16. Using the following commands:

rdiff-backup --force --print-statistics --exclude-special-files --include /etc --include /home --include /var/www --exclude /var --include /root --exclude / / root@<servername2>::/home/backups/images2 rdiff-backup --force --remove-older-than 2M root@<servername2>::/home/backups/images2

(again, I added the --force options to see if it would not regress...)

returned this output:
Previous backup seems to have failed, regressing destination now.
Exception 'CRC check failed' raised of class '<type 'exceptions.IOError'>': File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 369, in reval
  if isinstance(result, Exception): raise result

Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.error_check_Main(sys.argv[1:])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line 341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line 51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py", line 369, in reval
  if isinstance(result, Exception): raise result
IOError: CRC check failed
Fatal Error: Lost connection to the remote system

Seems like the last line is a big issue. Is there any further descriptor to be had for the lost connection error (I've tried running rdiff with both -v5 and -v7 levels but neither seemed to give me any more info on the lost connection error, and this error does re-occur on each successive running)? This backup data (241GB) set took several tries to get to run properly, however it did complete successfully on 3/23 (after running for 23 hours to complete). Since then it has thrown up the 'previous backup seems to have failed, regressing destination' errors each time. I have the network almost to myself this week, so there's not a lot of extra traffic impeding packet flow and no obvious reasons for a lost connection error (i.e. - the link has not seemed to go down [at least not that cacti or nagios noticed]).

Thanks in advance for any help on either of these.
  ~bob




Attachment: bmead.vcf
Description: Vcard


reply via email to

[Prev in Thread] Current Thread [Next in Thread]