Re: [rdiff-backup-users] Regression errors

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Regression errors

From:	Bob Mead
Subject:	Re: [rdiff-backup-users] Regression errors
Date:	Tue, 31 Mar 2009 14:26:39 -0700
User-agent:	Thunderbird 2.0.0.18 (X11/20081125)

Hello Maarten:

Thanks for your response. I will try to answer your points one at atime, hopefully clarifying what the problems are and steps I have takento resolve them.


Maarten Bezemer wrote:

Hi Bob,
First, let me say that your situation is not quite like mine. I userdiff-backup started from the backup server, so I'm doing "pull style"instead of your "push style". Also, I run rdiff-backup as normal useron the backup side and as root on the source side (since I obviouslycannot read each user's files as non-root). Rdiff-backup keeps recordsof metadata separately, so I don't need root at the backup server.
Second, I looked back in my mail archives and found this:
When I call rdiff using this cmd, it locks up the destination server
(console showed 'BUG locked Processor 1 for 11s' messages).
I unfortunately cannot find any document describing this errormessage. However, it's possibly a kernel message. If so, it mightindicate some of your hardware if dying.Failed CRC checksums mostly come from broken hardware (RAM, CPU, harddrives, hard drive cables, or power supply, to name just a few). Thereis nothing you can do about that with software.

The BUG locked Processor error was a long time ago and according to anarticle Andrew directed me to, it was due to a problem with ubuntu 8.04.At the time, I did run memtest on the server that gave up the error forsome hours and it never did fail or find any errors. I have not seenthat particular error since then and I am not using ubuntu 8.04 anymoreas a result.

Third, I re-read some of your emails about your situation and whatyou've been trying to do. Having missing metadata files also mightindicate hardware problems. Or maybe it's something related to kernelversions and data corruption on your file systems. Either way, it'spretty bad.

I do not have missing metadata files that I know of. I mis-typed"current-metadata" files for "current-/mirror"/ files in my most recentpost. At Andrew's suggestion, I had adjusted the Current Mirror file toindicate a prior time to 'fool' rdiff into believing that it had notalready run. When I did this (by renaming the file with an earlierdate), rdiff did run, but complained about not finding the metadatafiles and said that it would use the filesystem instead. The backup hasnot run properly since then.

Before going any further, please make sure you're using reliablehardware at all servers. See if you don't have any leaking capacitorson your mainboards or inside the power supply. Next, get a memorytesting program (memtest86+ or memmxtest) and have that run overnight.(I say overnight because it needs to run for at least a few hours andthe server needs to be brought down which usually isn't possibleduring day-time hours.) Harddisk diagnostics tools might also come inhandy.
If everything turns out to be OK, we could go suspect software bugs.
Oh, by the way, could you give us kernel versions of the machinesyou're using? (copy/paste the output of "uname -a" ) Some kernelversions are known for causing data corruption in certain file systemtypes.

Problem #1:

Origin/source server: Linux 2.6.7-gentoo-r5 #2 SMP Wed Nov 30 12:40:39PST 2005 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz GenuineIntel GNU/Linux.

Destination/backup server: Linux 2.6.15-51-amd64-server #1 SMP Tue Feb12 17:08:38 UTC 2008 x86_64 GNU/Linux


Problem #2:

Origin/source server: Linux 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41UTC 2009 i686 GNU/Linux.

Destination/backup server: Linux 2.6.27-11-server #1 SMP Thu Jan 2920:19:41 UTC 2009 i686 GNU/Linux



   You wrote earlier that upgrading or doing just anything with the
   server running rdiff-backup 1.0.4/1.0.5 is out of the question
   because of lack of resources. An alternative might be to first use
   rsync to synchronise your data to another server, and then use
   rdiff-backup from there. That gives you the opportunity to "play
   around" with different rdiff-backup versions without risking a
   "total breakdown" of the primary server.

Again, lack of resources prevents me from doing this on a network widebasis. I don't have any spare servers to rsync to and the time it wouldtake to do that and then try to rdiff that result somewhere else isbeyond the carrying capacity of our network and/or availabletimes/bandwidths. I am actually working on a buildout of additionalservers for placement at each remote site which will act as localbackups and I will be doing exactly that (rsync to that new localmachine and then rdiff from there to the backup server) however thatproject may take some months to complete.

The things you wrote made me a bit nervous. Like this: "The workaround seemed to be the renaming of the current meta-data file to atime prior to the next run of rdiff."Doing such things is very likely to screw up any repository...especially regressions to previous states WILL break when the metadatafiles are messed up.I've been using rdiff-backup for years now, and not a single time dida regress fail on me.

Based on my traceback results, have the regressions actually failed? AllI see is messages about regressing destination now. There doesn't everseem to be any message about what happens after that.

And yes, I've had rdiff-backup regress my repo quite often since ADSLlinks haven't always been as stable as the are today. Also, I neverhad to do special things to metadata timestamps or whatever.

see above - I meant to write 'current-mirror files' not'current-metadata', my mistake.

On the other hand, you once mentioned that one of the servers had aclock that was way off. Only recently I saw something on this mailinglist about using clocks of different sides for doing calculations thatshould have been using clocks at only one side. Maybe you ran into asimilar issue that screwed up your repo?


   If you insist trying to fix this "the software way", I have a
   suggestion for you. The second point problem in your email talks
   about a 23-hour run of rdiff-backup. Given the size of the backup,
   I'd say that this was an initial run and there aren't some hundreds
   of increments in play here?

From my original post (below): "This backup data (241GB) set tookseveral tries to get to run properly, however it did completesuccessfully on 3/23 (after running for 23 hours to complete)". Perhapsthis is not as clear as I thought. Yes, this is the initial run and nothere are not any increments.Your wording here leads me to believe that you think this is anerroneous question, perhaps one that ought not to be answered, at leasthere, or by you. I am not 'insisting' on anything. I asked the list forhelp on two particular problems I am having - nothing more. If it turnsout that it is not the case that either problem I am having has anythingat all to do with software, I am more than happy to look elsewhere tosolve the problems. I wish I had the experience to see the 'CRC checkfailed' and immediately go to 'hardware issue'. Unfortunately, I don't.So I ask questions. I apologize if my asking has upset you.

If so, could you try rsync with the --checksum argument to synchronisethe backup to the source and see if there are more files being updatedthat should not have been changed, based on file modification timestamps. If you see such files then you're probably just out of luckand need some hardware replaced. Either in your computers or thenetworking equipment.

Since this is the initial run, there are only files that have changed(all of them) in the repo. I guess I'm not clear on what you're wantingto see here. If I rsync the repo as is, to the source I'm going to seewhat? Since there is only one backup, and it is the initial run, howwill rsyncing that run back to the source files tell me about changed files?

If you don't see any unchanged files being updated, then we're leftwith the question why rdiff-backup sees a failed CRC checksum. If youdidn't mess with metadata files on the given repository, we're lookingat some data corruption issue.

I haven't messed with any metadata files. The source data is rsynceddaily from the server that it is replacing (new-server runs rsync -aH at11pm daily to syncronize with old-server). Then that rsynced data set isrdiff'd to the backup server (new server pushes rdiff-backup at 4pmdaily). I purposely have the rdiff sessions start before the rsyncsessions to allow rsync to run overnight before the next day's rdiff.Perhaps the data is being corrupted by the rsync process?

As an aside, even if you don't want to rebuild your servers, therestill are some ways to compile a new version of rdiff-backup. I had todo this once for some clients that didn't want to upgrade from 1.2.2to 1.2.5 just yet. It turned out to be relatively easy to installpython2.4 + librsync + rdiff-backup in my own home directory, and havemultiple versions in active use by not using the standard pythonsite-packages location but setting some environment variables.

I am having enough troubles getting the versions I have to worksuccessfully. None of the errors I am seeing have ever been described as"fixed, upgrade and you will not see these any more". I have seen onlyone problem that Andrew described as giving a better message in newerversions.

I hope I gave you enough pointers to work with for now. Please reportback to the list if you have any news.
Regards,
 Maarten


On Thu, 26 Mar 2009, Bob Mead wrote:
Hello all:
I have a series of rdiff-backups that run every day to backup 10remote sites and total of 14 different servers. It seems that eachday, at least one of the backups fails. I have been working atgetting these to work flawlessly for 6 months, but it seems beyond mygrasp. In the last week, I thought I was hot on the trail of a'perfect run' but now I'm not so sure. For the past few days I havebeen having troubles with the same two servers' backups. These arepush type backups (as are all my backup jobs) with the remote serversrunning backup scripts to rdiff to, in this case, two differentdestination/backup servers.
In the first case: older Gentoo Linux system (running v1.0.5, dest.has v1.0.4) the following commands:rdiff-backup --force --print-statistics --include /etc --include/home --include /var --include /root --exclude / /root@<servername>::/home/backups/dorrdiff-backup --force --remove-older-than 2Mroot@<servername>::/home/backups/dor
(I added the --force option to test if that would clear up theregression problem, it didn't)
returned this as output:
Previous backup seems to have failed, regressing destination now.
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py",line 285, in Main
  take_action(rps)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py",line 255, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py",line 299, in Backup
  backup_final_init(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py",line 396, in backup_final_init
  checkdest_if_necessary(rpout)
File "/usr/local/lib/python2.5/site-packages/rdiff_backup/Main.py",line 911, in checkdest_if_necessary
  dest_rp.conn.regress.Regress(dest_rp)
File"/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py",line 445, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File"/usr/local/lib/python2.5/site-packages/rdiff_backup/connection.py",line 367, in reval
  if isinstance(result, Exception): raise result
IOError: [Errno None] None: None
Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in ?
  rdiff_backup.Main.Main(sys.argv[1:])
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line285, in Main
  take_action(rps)
File "/usr/lib/python2.4/site-packages/rdiff_backup/Main.py", line253, in take_action
  connection.PipeConnection(sys.stdin, sys.stdout).Server()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py",line 352, in Server
  self.get_response(-1)
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py",line 314, in get_response
  try: req_num, object = self._get()
File "/usr/lib/python2.4/site-packages/rdiff_backup/connection.py",line 230, in _get
  raise ConnectionReadError("Truncated header string (problem "
rdiff_backup.connection.ConnectionReadError: Truncated header string(problem probably originated remotely)
At some point recently (3/20), this backup worked. Then it started tofail, giving up regressing dest. errors each time it has run sincethen. This is the same backup I posted on recently where I had to'pull the wool' over rdiff's eyes because of a server datemalfunction. The work around seemed to be the renaming of the currentmeta-data file to a time prior to the next run of rdiff. That seemedto work in that it didn't complain about too many current mirrorfiles, but it did make rdiff unable to 'see' the metadata file andtherefore use the filesystem. Perhaps these problems are thenrelated? If so, any ideas on how to get it working again would begreatly appreciated. There should be two months of increments storedin the repository so I don't want to lose those by starting over.
The second failed backup is a brand new install of ubuntu 8.10running rdiff v1.1.16 pushing backups to another fresh 8.10 installalso running rdiff v1.1.16. Using the following commands:
rdiff-backup --force --print-statistics --exclude-special-files--include /etc --include /home --include /var/www --exclude /var--include /root --exclude / / root@<servername2>::/home/backups/images2rdiff-backup --force --remove-older-than 2Mroot@<servername2>::/home/backups/images2
(again, I added the --force options to see if it would not regress...)

returned this output:
Previous backup seems to have failed, regressing destination now.
Exception 'CRC check failed' raised of class '<type'exceptions.IOError'>':File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py",line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py",line 369, in reval
  if isinstance(result, Exception): raise result

Traceback (most recent call last):
File "/usr/bin/rdiff-backup", line 23, in <module>
  rdiff_backup.Main.error_check_Main(sys.argv[1:])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line302, in error_check_Main
  try: Main(arglist)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line322, in Main
  take_action(rps)
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line278, in take_action
  elif action == "backup": Backup(rps[0], rps[1])
File "/var/lib/python-support/python2.5/rdiff_backup/Main.py", line341, in Backup
  backup.Mirror_and_increment(rpin, rpout, incdir)
File "/var/lib/python-support/python2.5/rdiff_backup/backup.py", line51, in Mirror_and_increment
  DestS.patch_and_increment(dest_rpath, source_diffiter, inc_rpath)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py",line 447, in __call__
  return apply(self.connection.reval, (self.name,) + args)
File "/var/lib/python-support/python2.5/rdiff_backup/connection.py",line 369, in reval
  if isinstance(result, Exception): raise result
IOError: CRC check failed
Fatal Error: Lost connection to the remote system
Seems like the last line is a big issue. Is there any furtherdescriptor to be had for the lost connection error (I've triedrunning rdiff with both -v5 and -v7 levels but neither seemed to giveme any more info on the lost connection error, and this error doesre-occur on each successive running)? This backup data (241GB) settook several tries to get to run properly, however it did completesuccessfully on 3/23 (after running for 23 hours to complete). Sincethen it has thrown up the 'previous backup seems to have failed,regressing destination' errors each time. I have the network almostto myself this week, so there's not a lot of extra traffic impedingpacket flow and no obvious reasons for a lost connection error (i.e.- the link has not seemed to go down [at least not that cacti ornagios noticed]).
Thanks in advance for any help on either of these.
  ~bob

bmead.vcf
Description: Vcard

[Prev in Thread]

Current Thread

[Next in Thread]

[rdiff-backup-users] Regression errors, Bob Mead, 2009/03/26
- Re: [rdiff-backup-users] Regression errors, Maarten Bezemer, 2009/03/27
  - Re: [rdiff-backup-users] Regression errors, Bob Mead <=
- Re: [rdiff-backup-users] Regression errors, Andrew Ferguson, 2009/03/30
  - Re: [rdiff-backup-users] Regression errors, Bob Mead, 2009/03/31

Prev by Date: Re: [rdiff-backup-users] rdiff-backup completes without errors, but then have a corrupt archive...
Next by Date: Re: [rdiff-backup-users] Regression errors
Previous by thread: Re: [rdiff-backup-users] Regression errors
Next by thread: Re: [rdiff-backup-users] Regression errors
Index(es):
- Date
- Thread