bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multi-threading in sort(or core-utils)


From: Pádraig Brady
Subject: Re: Multi-threading in sort(or core-utils)
Date: Fri, 13 Jun 2008 17:15:32 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bo Borgerson wrote:
> address@hidden wrote:
>> Hello
>> Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core 
>> cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). 
>> I think it is good idea to make option(or by default) for sorting in threads 
>> to increase performance on systems that might execute more than one thread 
>> in parallel.
>>    Klimentov Konstantin.
> 
> Hi,
> 
> If you're using a shell that supports process substitution you could try
> splitting your file in half and putting the bulk sorts of each half as
> inputs to a merge:
> 
> So if you were doing:
> 
> $ sort bigfile
> 
> You could do:
> 
> $ sort -m <(sort bigfile.firsthalf) <(sort bigfile.secondhalf)

A few notes about that

1. `LANG=C sort ...` may be appropriate for your data, and should
be much quicker than using the collating routines for your locale

2. How to split a text file into chunks correctly is not
obvious to me. Here is a little snippet that might work:

file="$1"; chunks=2
size=$(find "$file" -printf %s)
line_fuzz=80 #to avoid single line for last chunk
chunk_max=$((($size+$line_fuzz)/$chunks))
split --line-bytes=$chunk_max "$file"

3. The file chunks if, on traditional hard disks,
should be on separate spindles. I.E. if both sort processes
are reading off the same spindle, they will be fighting over
the disk heads and just slow things down a lot.

4. If processing from a Solid State Disk, or doing multiple runs
from cache etc, it will probably be quicker to process the portions
of the file directly, without having to split them up first.
Here is a snippet to process (approximately) each half of
a text file directly:

size=$(find "$1" -printf %s)
half=$(($size/2))
next=$(dd if="$1" bs=1 skip=$half 2>/dev/null | sed q | wc -c)
chunk2_s=$(($half+$next))
#note head will read in blocks of 8192
sort -m <(head -c $chunk2_s "$1" | sort) \
        <(tail -c +$(($chunk2_s+1)) "$1" | sort)

5. I think it would be nice for dd to support reading portions of
a file efficiently. As far as I can see it can only do it by reading
1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would
be useful additions?

cheers,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]