Re: cut -b on huge files

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cut -b on huge files

From:	Bob Proulx
Subject:	Re: cut -b on huge files
Date:	Thu, 9 Oct 2008 11:08:12 -0600
User-agent:	Mutt/1.5.13 (2006-08-11)

Klein, Roger wrote:
> I have problems when too many things get mixed in one mail that's why
> please let me sort a little. I see three threads now:
> 
> 1) the problem I had and I was trying to solve: when copying a sparse
> file onto Windows the resulting file occypies the full apparent size on
> the disc; I now solved that by simply gzipping the files in Linux before
> copying them (thx for pointing out that cut will truncate the file!).
> Unfortunately gunzip will not recreate the sparese file but a flattened
> file occupying the full space.

That is correct.  Very few tools are written to be able to take
advantage of sparse files and optimize out the disk space.  Being able
to do this is somewhat of a trick.

I have to wonder why the original file was sparse in the first place?
I wonder if that was intentional or just an accident.

> 2) the reason I wrote to 'address@hidden' was that no matter if
> the source is a sparse file or not the command
> cut -b 1-58101760 file1 > file2
> should never create a file2 with the sizes
> Size: 309987280       Blocks: 606048
> This seems to me like a bug in cut.

58101760 is not equal to 309987280 and I would never expect that using
cut with 58101760 as an argument would ever produce a file of size
309987280.  58101760 is much smaller than 309987280, right?  And as
Pádraig pointed out in his message the counts will reset with each
newline in the file therefore using cut with an 309987280 argument is
unlikely to do what you want in the presence of newline characters.
'cut' just isn't the tool for the job here.  If you want to flatten
the file you should use other tools.

  cp --sparse=never file1 file2
  cat file1 > file2

And I don't think you should be truncating the file at all.

> Could you inform the authors of that tool?

The maintainers of the 'cut' program read the bug-coreutils mailing
list.  By sending messages and discussing the issues there you have
contacted them.

> 3) from your explanation and suggestions stem a few questions:
> 
> > For example you can use dd to create a sparse file:
> > 
> >   dd bs=1 seek=1G if=/dev/null of=big
> > 
> > That will have an apparent size of 1G but will actually consume almost
> > no actual disk space.
> 
> Well, I did a stat on that file: it shows 0 occupied blocks. Shouldn't
> it occupy at least one block holding the one read and written zero-byte?

There were no bytes written to that file.  It really was very sparse
with nothing but implied zero data in the file.

> If I do the same with a one byte file that I read from:
> echo -n A > oneA
> dd bs=1 seek=1M if=oneA of=bigA
> stat bigA
>   File: `bigA'
>   Size: 1048577         Blocks: 16         IO Block: 4096   regular file
> It shows the expected apparent length of 1M + 1 byte. But why does it
> occupy 16 blocks for the one byte?

This will consume more than one byte or even one disk block of actual
disk space.  It will take several blocks and every different type of
filesystem may use a different amount of blocks here.  On mine it
shows 8 blocks consumed.

Different storage schemes in different filesystems will require
different amounts of actual disk space.  If you want to try appending
data to the file as an additional experiment:

  dd bs=1 seek=1G if=/dev/null of=big
  ls -log big
  du big
  echo >> big
  ls -log big
  du big

By appending (with the >> append operator) we append a newline to the
end of the file.  The filesystem will need to store this taking into
consideration the previous gap in the file.  This shows 12 blocks
consumed on my system.

> > Try comparing the two files.
> >   cmp boot_image.clone2fs boot_image.clone2fs_correct
> > If they don't compare then I believe that you have corrupted the file.
> 
> They don't compare so the file is corrupt, indeed.

I think trying to "correct" the size of these files is throwing you
off.  If you had never looked at the sizes everything would be okay,
right?  The files would copy okay and everything would be operating
normally.  So I think this is just a good learning about sparse files
and I wouldn't worry about it further.  Of course I am sure there was
something going on that got you looking down this path.  That would be
a different problem for a different day. :-)

Bob

[Prev in Thread]

Current Thread

[Next in Thread]

cut -b on huge files, Klein, Roger, 2008/10/08
- Re: cut -b on huge files, Bob Proulx, 2008/10/08
  - Message not available
    - Re: cut -b on huge files, Bob Proulx <=
- Re: cut -b on huge files, Pádraig Brady, 2008/10/09

Prev by Date: Re: coreutils-7.0 beta test runs
Next by Date: coreutils.texi node comm/ptx/tsort node ordering
Previous by thread: Re: cut -b on huge files
Next by thread: Re: cut -b on huge files
Index(es):
- Date
- Thread