|
From: | Timothe Litt |
Subject: | Re: tar is creating corrupt archives when soft links are present |
Date: | Thu, 1 Dec 2022 16:14:42 -0500 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 |
i've seen this on Fedora Core 4; the report is on FC 6. (Yes, they're old. But tar is new, built from source downloaded from ftp.gnu.org.)
The disk volume is a newly created (VirtualBox) vdi; 2 partitions, ext3, with the root mounted on hda2. (boot is on hda1).
The file structure was initialized on a newer Linux machine and
the archive extracted. It's been a long few days, I don't
remember if it was fc34 or debian...both were involved in putting
things back together.
The original reproducer is cut down from about 130G (a 34G compressed archive). There are "only" 107 files in /bin.
Here is the information from your suggestions.
The hard link problem reproduces with this (note the two soft links turning into a soft and a hard(!) - according to tar:
# ( cd / && ls -li bin/awk bin/bash && tar
-cf - bin/awk bin/bash | tar -tvf - )
22683669 lrwxrwxrwx 1 root root 4 Nov 28 08:45
bin/awk -> gawk
22683657 lrwxrwxrwx 1 root root 21 Nov 28 08:45
bin/bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/awk -> gawk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/bash link to bin/awk
Clearly, the bin/bash (a) is not a hard link on disk, and (b)
does not link to bin/awk.
The attached "hardlink_strace.txt" comes from a simplified command to reduce volume, but it should show the same syscalls:
( cd / && strace 2>hardlink_strace.txt tar -cf - bin/awk bin/bash >/dev/null )
A full ls -li is in full_ls.txt
In extract_from_tar_archive_showing_extent.txt is the first ~1900 lines of tar -tvf from an archive that merged all the soft links to "vi" when extracted to disk. Note that the listing (a) shows the links as hard links (they were all soft on the original disk), and (b) shows the links as to "bin/ex", when in fact they were extracted as "vi".
To me, this all points to soft links being processed as if they were hard - mostly.
Going further with the toy example, we see that while tar reports
the links as hard, they are extracted as
soft, but with the wrong target
for the second link.
foo]# ( cd / && tar -cf - bin/awk bin/bash | tar -C
/root/foo -xvf - )
bin/awk
bin/bash
foo]# ls -li bin
!! This is bin
extracted from the archive
total 0
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:23 awk
-> gawk
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:23 bash
-> gawk
foo]# ls -li /bin/awk /bin/bash || This is the bin
that was archived
22683669 lrwxrwxrwx 1 root root 4 Nov 28 08:45
/bin/awk -> gawk
22683657 lrwxrwxrwx 1 root root 21 Nov 28 08:45
/bin/bash -> ../usr/local/bin/bash
To close the shell wildcard lead: if we now use (shell) wildcards, which pick up a couple of extra files), note that the bash link (to ../usr/local...) is still extracted as a soft link to gawk.
Here's the modified test case:
foo]# ( cd / && tar -cf - bin/aw* bin/bas* | tar -C
/root/foo -xvf - )
bin/awk
bin/basename
bin/bash
bin/bash.old
:foo]# ls -li bin
total 732
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:32
awk -> gawk
17418580 -rwxr-xr-x 1 root root 18484 Oct 31 2007
basename
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:32
bash -> gawk
17418581 -rwxr-xr-x 1 root root 722684 Jul 12 2006
bash.old
An strace of the above in strace_wild.txt was obtained as shown below (the inode #s are different)
foo]# ( cd / && ls -li bin/aw* bin/bas* &&
strace 2>/root/strace_wild.txt tar -cf - bin/aw* bin/bas*
>/dev/null )
22683669 lrwxrwxrwx 1 root root 4 Nov 28 08:45
bin/awk -> gawk
22683748 -rwxr-xr-x 1 root root 18484 Oct 31 2007
bin/basename
22683657 lrwxrwxrwx 1 root root 21 Nov 28 08:45
bin/bash -> ../usr/local/bin/bash
22683691 -rwxr-xr-x 1 root root 722684 Jul 12 2006
bin/bash.old
foo]# ls -li bin/
total 732
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:32
awk -> gawk
17418580 -rwxr-xr-x 1 root root 18484 Oct 31 2007
basename
17418579 lrwxrwxrwx 2 root root 4 Dec 1 15:32
bash -> gawk
17418581 -rwxr-xr-x 1 root root 722684 Jul 12 2006
bash.old
Also, while l didn't keep the build directory for tar, I did keep
the configure cache file, which may be helpful.
Not sure if I can recover what's left of the original disk; will try if necessary. But I think this work has cut the problem down.
a) tar is confused about soft links.
b) it is reporting soft links as hard in -t output, but extracting
them as soft
c) The extract uses the wrong target in the soft link - the target
of the first soft link that it sees.
# uname -a
Linux 2.6.22.14-100 #1 SMP Wed Apr 8 18:07:54 EDT
2015 i686 i686 i386 GNU/Linux
Finally, an unrelated (except that it hit this incident and prevented an easy restore) issue: tar skips some large files with
tar: root/sd/sd.tar.gz: Cannot stat: Value too large for defined data type
-rw-r--r-- 1 root root 32251081571 May 6 2007 /root/sd/sd.tar.gz
Let me know if I can provide further information. I appreciate the attention. Thanks!
Timothe Litt ACM Distinguished Engineer -------------------------- This communication may not represent the ACM or my employer's views, if any, on the matters discussed.
Thanks for reporting the problem. I'm not seeing the problem with GNU tar 1.34 as shipped with Ubuntu 22.10 x86-64. On this platform, the command:
cd /
tar -cf - bin/* | tar -tvf - >/tmp/tar.txt
outputs the attached file tar.txt, which looks OK, as it seems to match the output of the command 'cd /; ls -li bin/* >/tmp/ls-i.txt' which is attached. This is on an ext4 file system. (All the attachments are compressed with gzip.)
What would help to debug here is a smaller reproducer. Can you reproduce it with a smaller command like this?
tar -cf - bin/awk bin/bash
In other words, make it as small as you can.
Also, even if you can't make it small, it'd be helpful to see the strace output so that we can see the information that tar is basing its decisions on. For example, I ran this command:
strace -v --trace %%stat -o /tmp/tar-tr.txt tar -cf /dev/null bin/*
and got the attached file tar-tr.txt to see what the stat-like syscalls are yielding; can you do something similar?
Also, can you send the output of 'ls -il bin/*'? The inode numbers would be helpful for debugging, I expect.
link_info.tar.gz
Description: GNU Zip compressed data
OpenPGP_signature
Description: OpenPGP digital signature
[Prev in Thread] | Current Thread | [Next in Thread] |