> Are you in the high memory consumption scenario which Nigel describes?
The issue is running parsort on large scale machines. Running on all cores is mostly not desirable for memory intensive applications. The memory channels become the bottle neck, eventually.
The mcesort variant has reached the incubator stage (code 100% completed). It supports the -j (short option) and --parallel. Obviously, specifying 1% will not be less than 1 CPU core, minimally.
-jN integer value
-jN% percentage value; e.g. -j1% .. -j100%
-jmax or -jauto same as 100% or available N logical cores
The test file is a mockup of random generated key-value pairs. There are 323+ million rows.
$ ls -lh /dev/shm/huge
-rw-r--r-- 1 mario mario 2.8G Feb 18 00:48 /dev/shm/huge
$ wc -l /dev/shm/huge
323398400 /dev/shm/huge
Using parsort, one cannot specify the number of cores processing a file. So, it spawns 64 workers on this machine. The Perl MCE variant performs similarly. I get better throughput by running 38 workers versus 64.
$ time parsort /dev/shm/huge | cksum
3409526408 2910585600
real 0m18.147s
user 0m13.920s
sys 0m3.660s
$ time mcesort -j64 /dev/shm/huge | cksum
3409526408 2910585600
real 0m18.081s
user 2m52.082s
sys 0m10.860s
$ time mcesort -j38 /dev/shm/huge | cksum
3409526408 2910585600
real 0m16.788s
user 2m21.384s
sys 0m8.263s
Regarding standard input, I can run parsort using a wrapper script (given at the top of this email thread). Notice how parsort has better throughput running 38 workers.
$ time parsort -j64 </dev/shm/huge | cksum
3409526408 2910585600
real 0m19.553s
user 0m14.030s
sys 0m3.520s
$ time mcesort -j64 </dev/shm/huge | cksum
3409526408 2910585600
real 0m18.312s
user 2m42.042s
sys 0m11.546s
$ time parsort -j38 </dev/shm/huge | cksum
3409526408 2910585600
real 0m17.609s
user 0m11.856s
sys 0m3.451s
$ time mcesort -j38 </dev/shm/huge | cksum
3409526408 2910585600
real 0m16.819s
user 2m21.108s
sys 0m9.523s
I find it interesting in not seeing the total user time running parsort (tally of all workers' time).
This was a challenge and can see the finish line, hoping by next week.