On 2020-11-11 00:39, I. Hope Nothing wrote:
> I have a large (183 GB) .tar file that has become corrupted.
> [...]
> There's a lot of binary data I want to keep on here. I am willing and
> keen to learn how to forensically retrieve my data, and I would
> greatly appreciate any help pointing me in the right direction. Thank
> you for reading this far already!!
This is simple hints for attempting manual rescue.
Thank you for your answer already. I have been thinking about what you wrote, and other things that came to mind even before I posted. I am well aware of the magnitude of this complexity of this task.
In the best case scenario I have been hoping that whatever solution I come up with could be made into a generalized "damaged .tar file fixer upper".
1. If possible, obtain a less corrupted copy of the tar file.
For example, if it was corrupted when extracting it from a tape
over ssh or rlogin, try extracting it again using a binary-safe
protocol. Similarly if it was corrupted after decompressing with
gzip, bzip2 or any other such tool, try decompressing again.
This is not possible, unfortunately :-(
2. Try to obtain a dos2unix implementation that doesn't try to be
"smart", basically, you need to do a binary search replace from
\r\n to \n while leaving alone any other bytes with the value 13.
This will still loose any \r\n sequence that was in the original
data, but there will probably be less corruption than in the file
that was erroneously subjected to the opposite search replace.
Sure. Assuming that what has happened is what I think probably did happen...
I think that line endings got mangled during a botched FTP transfer. More worryingly, for some reason there is also a "Password:" prompt as Line 1, which concerns me because I wonder if there is more to the damage than simply mangled line endings, e.g., perhaps STDOUT or STDERR got redirected somewhere it shouldn't have.
Please correct me if I'm wrong, but in the simplest assumed case, the one potentially irreversibly mangled case is where there was a 13 ('\r') that was NOT at the end of a text line but which was part of binary data and therefore converted to 13 10 ('\r' '\n'). Here I need to implement my own forensic logic, possibly based on probabilistic methods, that this gets this converted back to plan 10 ('\n') where the correctness of this conversion is judged by:
- the .tar file listing its contents without error;
- extracting without error, and;
- the file where that conversion-and-then-reverse-conversion is located is apparently functioning properly after extraction (there are many ways this can be tested).
3. Look up the tar file format specifications, it is actually a
relatively simple file format and you will need to understand it
to do the manual data rescue. In particular, you will need to
understand the PAX and GNU extensions to the format.
This is one of the first things that came to my mind. So far I know of the following sources of information:
- The GNU Tar documentation and source code
- Schily's star documentation and source code
**If you know of any other source code of specifications I should be aware of, please let me know.**
4. Using a binary file viewer, look for the tar header that marks
the start of a much wanted file. Then look for the tar header
of the next file in the archive. The bytes between the two
headers are supposed to be your file contents and the header
before the contents should give the number of bytes in the
uncorrupted file. If you did step 2 above, the actual data
will probably be slightly too short due to too many removed \r
characters, or due to the terminal protocol also removing some
other bytes.
What binary file viewers do you recommend?
Up until now the only binary file viewers I've used were `od` and `hexl-mode` in Emacs, and casually at that. If you have better suggestions, I'd appreciate it!!
5. Use knowledge of your actual file format to figure out where
an \r was probably lost and use the correct file length from
the tar header as a cross check of your efforts.
See my reasoning above when checking sanity checking the results of 2.
6. Repeat steps 4 and 5 for each file.
Yes.
Good luck, you will need it.
Thanks again Jakob. As I mentioned before, I'm hoping that something positive can come out of this forensic work.
Kind regards,
I. Hope