bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16004: Multicore Core-utils


From: Pádraig Brady
Subject: bug#16004: Multicore Core-utils
Date: Fri, 29 Nov 2013 23:05:54 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 11/29/2013 10:18 PM, CDR wrote:
> Dear friends
> 
> In case this email is read by Richard M. Stallman and David MacKenzie.
> I need a multi-core version of "comm" and "join". The current version
> only uses one core and it takes hours to process two files, with 4
> columns and 510 million lines. I need to process those files every
> night.
> 
> I wonder if any  plan exists to jump to multicore. If not, is there a
> volunteer that can do the job, for a reasonable fee? I am one-man
> company but I guess we all need a parallel-processing-capable
> core-utils.

Note comm and join need a sorted file and sort(1)
is already multicore aware.  Since sorting needs
to implicitly handle all the input before generating output,
it makes sense for sort(1) to handle that itself.
Also the sorting operation itself is relative expensive
compared to the corresponding I/O involved, which
further justifies the multicore knowledge within sort(1).

So if you're dealing with an already sorted file,
it then often depends on the I/O for that file
which could be a bottleneck.  For example if your data file
that "takes hours to process" was on a mechanical hard disk,
then processing with a single thread/process is probably best,
otherwise multiple ones would be just seeking the disk head
and slow things down.  The increasing prevalence of SSDs
changes the game here though, so that separate accesses
to the same file could very well be a win.

BTW you haven't said whether you're I/O or CPU bound.
I presume you're CPU bound given you're mentioning multicore,
which is a little surprising given the relatively inexpensive
operations done within comm(1) and join(1).
It's worth mentioning locales here, because if you don't
need the relatively expensive locale matching rules,
you can disable those before a run by setting:
  export LC_ALL=C
If that did change things to be I/O bound again then
you might consider putting each file on separate devices,
to gain from parallel I/O operations.

So if you're still CPU bound, a more general technique to consider,
is splitting up the file to be processed by separate _processes_.
Now this is more sorted to tools that don't have relevance on
the relative order of particular lines which unfortunately
comm(1) and join(1) do, but perhaps there is some way you
could split your data to more files when generating it,
which could then be fed to separate join(1) processes.

thanks,
Pádraig.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]