bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort: memory exhausted with 50GB file


From: Leo Butler
Subject: Re: sort: memory exhausted with 50GB file
Date: Sat, 26 Jan 2008 15:05:30 +0000 (GMT)

< Paul Eggert <address@hidden> wrote:
< ...
< > Hmm, it sounds like your input data has some very long lines, then.
< > That would explain at least part of your problem, then.  'sort' needs
< > to keep at least two lines in main memory to compare them: if single
< > input lines are many gigabytes long, then 'sort' must consume many
< > gigabytes of memory, regardless of what parameter you specify with '-S'.
<
< You can run this to find the maximum line length:
<
<   wc --max-line-length your-data



Ok, first, let me thank Jim, Bob and Paul.
Here is the problem in a nutshell:

wc is counting with long ints, and the first line of this 50GB file is a string 
of \0 whose length appears to be negative when counted with long ints. (Details 
below).

I believe that this must be an error in the header file where 'uintmax_t' is
defined. 

I do not know if one can consider this behaviour as a bug in sort, but
it seems to me that sort might issue a warning if it encounters 'n>0' 
consecutive null characters in a file. 

---
I have squeezed out the null characters with tr and am attempting
to sort the transformed file. This has shrunk the file from 50GB to 7GB, so I 
anticipate no problems. I will report back.
---

Leo Butler.


Details:
-------
In my original post I mentioned I did count the max line length:

$ /usr/bin/wc -L /data/espace/k_400_a.out
107

Here is the censored output of a routine that counts the occurence of all ascii 
characters:

$ ./census /data/espace/k_400_a.out
Ascii char      Count
----------      -----
\0 Null character               -1363090872
(snip)

The longest line was identified at about line 65x10^6 with 108 chars incl. 
\n.

Ouch! Look at that count of \0. The routine was counting with long ints, so I 
recompiled it with unsigned longs, and got

Ascii char      Count
----------      -----
\0 Null character               2931876424
(snip)
Longest line    2931876444 chars at line 1

The counts of \0 are congruent mod LONG_MAX. Apparently, the first line 
contained roughly 42GB worth of null characters. I have no bleeding idea how
this creeped in.

LB.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]