Re: Cannot specify the number of threads for parsort

Hi folks. I would like to share some tips for parsort if the author wishes to implement. Please feel free to ignore.

1. Some folks specify the -S SIZE option. That may slow down parsort. To minimize the slowness, give the -S option only to the first stage, but not to the 2nd stage; e.g. sort -m. Currently, mcesort does not accept the -S option. But, I tried it and how I knew to not specify -S to sort -m. Doing so further degrades performance.

2. On large-scale systems, limiting the number of processor threads for the bash running merge or sort -m is beneficial for better performance. The reason is reaching memory-channel limits. On Linux, mcesort limits the bash command handling merge to max 32 processor threads via taskset.

3. Running parsort with thousands of files as arguments crashed my system twice. I had no idea at the time that it spun a process per file argument. We now know to cat the list of files and have parsort read STDIN instead. mcesort does not spin more workers than requested no matter if STDIN, hundreds or thousands of files as arguments, or a single file.

4. The --files0-from option is broken in parsort. Give parsort --files0-from a try. It's unable to open files (relative or full paths).

5. Something else I tried is mcesort --tally="tallycmd [options]". It is beneficial to reduce duplicate keys, but need to tally the count field. The effect is lesser work for subsequent sort -m.

Extract from https://perlmonks.org/?node_id=11150872

H) 552 big files (6 * 92), Unix sort command
GNU parallel parsort, mcesort, and tally-count

cat many files | parsort -k1 | tally-count | parsort -k2nr | cksum

cat many files | mcesort -k1 | tally-count | mcesort -k2nr | cksum

parsort: 2m25.776s GNU Parallel
mcesort: 2m 3.140s MCE variant

I) 552 big files (6 * 92), Unix sort command
mcesort with --tally="tallycmd [options]"

cat many files | mcesort -k1 --tally="tally-count" | mcesort -k2nr | cksum

mcesort: 1m15.028s MCE variant

$tally_cmd defaults to 'cat'

local $_; $_ = "<(cat<$_)" for @$merge_list; # edit in-place

while (@$merge_list > 1) {
my @temp;
while (@$merge_list) {
my @files = splice @$merge_list, 0, 2;
push @temp, (@files >= 2)
? "<(sort -m @$sort_args @files|$tally_cmd)"
: "@files";
}
@$merge_list = @temp;
}

$merge_cmd = shift @$merge_list;
$merge_cmd =~ s/ -m / -m /g; # trim double spaces: !sort_args
$merge_cmd =~ s/\A<$//; # trim leading <(
$merge_cmd =~ s/$\z//; # trim trailing )
$merge_cmd =~ s/\|cat\z//; # trim trailing |cat

Blessings,

- Mario

P.S. I'm finished with the project. Again, I have no intentions for mcesort to be popular or anything like that. It remains hidden in a gist. My humble rant.... I'm not in favor of parsort creating, by default, thousands of files in $TMPDIR. Other users on the box may not like having to scroll through pages and pages to look for a file. mcesort creates a folder inside $TMPDIR. Unfortunately, Perl lacks a trap EXIT capability similar to bash. So, I have bash handle tmpdir removal via trap EXIT. That works reliably including Ctrl-C. It may not matter, but I wish for parsort to be mindful of multi-user environments.

On Sat, Feb 25, 2023 at 2:47 AM Mario Roy <marioeroy@gmail.com> wrote:

mcesort using mini-Perl-MCE (a parsort variant) with working -j, --parallel, and more
https://gist.github.com/marioroy/d30a3408474612dc1d289acdc6fbf19a

My intention is never about competing with parsort or anything like that. I have fulfilled my own wish and have -j, --parallel including -A (sets LC_ALL=C). It runs on Linux and FreeBSD (supporting FreeBSD's various options) including running on many processor cores.

Development occurred on Linux, plus testing on FreeBSD, Darwin (macOS), and Cygwin.

I invited Grace and together we created mcesort, a parsort variant.

On Wed, Feb 22, 2023 at 11:16 PM Mario Roy <marioeroy@gmail.com> wrote:

I ran the following commands to capture the number of sort processes in another terminal window. Seeing > 1,100 processes threw me off guard and experienced the system locking briefly using parsort (many files). Well, that's definitely a second wish list -- why so many processes... I witnessed processes consuming 11% CPU.

parsort

while true; do ps -ef | grep sort | grep -v parsort | wc -l; sleep 1; done

mcesort

while true; do ps -ef | grep sort | grep -v mcesort | wc -l; sleep 1; done

On Wed, Feb 22, 2023 at 10:58 PM Mario Roy <marioeroy@gmail.com> wrote:
Aloha,

Congratulations on supporting parsort --parallel. I was wondering why the high number of processes (many files) until reading the 20230222 release notes. Now I understand.

First and foremost, mcesort is simply a parsort variant using mini-MCE parallel engine, integrated into mcesort. I reduced MCE code to the essentials (less than 1,500 lines). The main application is 400 lines. Currently, mcesort is < 1,900 lines.

1) mcesort supports -A (sets LC_ALL=C) and -j, --parallel N, N%, or max (-j12, -j50%, -jmax).

2) currently, mcesort does not allow -S, --buffer-size. From testing, specifying -S or --buffer-size leads to more memory consumption and degrades performance. Is -S --buffer-size helpful from parsort/mcesort perspective?

3) mcesort runs -z --zero-terminated in parallel, unlike parsort consuming one core.

4) mcesort accepts --check, -c, -C, --debug, --merge [--batch-size], and simply passes through and runs sort serially, never returning, if checking or merging sorted input or debugging incorrect key usage.

exec('sort', @ARGV) if $pass_through;

Respectfully, I captured results using parsort 20230222 and mcesort (to be released soon).

#################################################################
~ List of Files (total: 6 * 92 = 552 files) 17 GB Size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time LC_ALL=C parsort --parallel=64 -k1 \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
867518687 17463513600

1,109 processes created (brief system lockup)
physical memory consumption peak 7.79 GB

real 1m59.565s
user 1m27.735s
sys 0m22.013s

mcesort
~~~~~~~
$ time LC_ALL=C mcesort --parallel=64 -k1 \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
1 sort and 1 merge per worker
physical memory consumption peak 2.92 GB

real 1m57.209s
user 21m55.152s
sys 1m15.790s

#################################################################
~ Single File 17 GB
~ cat /dev/shm/big* >> /dev/shm/huge (6 times)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time LC_ALL=C parsort --parallel=64 -k1 /dev/shm/huge | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.90 GB

real 2m11.056s
user 1m39.646s
sys 0m22.040s

mcesort
~~~~~~~
$ time LC_ALL=C mcesort --parallel=64 -k1 /dev/shm/huge | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.83 GB

real 1m53.255s
user 23m52.807s
sys 0m58.450s

#################################################################
~ Standard Input
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

parsort
~~~~~~~
$ time cat \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
| LC_ALL=C parsort --parallel=64 -k1 | cksum
867518687 17463513600

193 processes created (no system lockup, fluid)
physical memory consumption peak 3.05 GB

real 2m18.442s
user 1m39.051s
sys 0m27.548s

mcesort
~~~~~~~
$ time cat \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
/dev/shm/big* /dev/shm/big* /dev/shm/big* \
| LC_ALL=C mcesort --parallel=64 -k1 | cksum
867518687 17463513600

128 processes created (no system lockup, fluid)
physical memory consumption peak 2.75 GB

real 1m57.487s
user 22m16.476s
sys 1m15.481s

On Sat, Feb 18, 2023 at 3:42 AM Mario Roy <marioeroy@gmail.com> wrote:
> Are you in the high memory consumption scenario which Nigel describes?

The issue is running parsort on large scale machines. Running on all cores is mostly not desirable for memory intensive applications. The memory channels become the bottle neck, eventually.

The mcesort variant has reached the incubator stage (code 100% completed). It supports the -j (short option) and --parallel. Obviously, specifying 1% will not be less than 1 CPU core, minimally.

-jN integer value
-jN% percentage value; e.g. -j1% .. -j100%
-jmax or -jauto same as 100% or available N logical cores

The test file is a mockup of random generated key-value pairs. There are 323+ million rows.

$ ls -lh /dev/shm/huge
-rw-r--r-- 1 mario mario 2.8G Feb 18 00:48 /dev/shm/huge

$ wc -l /dev/shm/huge
323398400 /dev/shm/huge

Using parsort, one cannot specify the number of cores processing a file. So, it spawns 64 workers on this machine. The Perl MCE variant performs similarly. I get better throughput by running 38 workers versus 64.

$ time parsort /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.147s
user 0m13.920s
sys 0m3.660s

$ time mcesort -j64 /dev/shm/huge | cksum
3409526408 2910585600

real 0m18.081s
user 2m52.082s
sys 0m10.860s

$ time mcesort -j38 /dev/shm/huge | cksum
3409526408 2910585600

real 0m16.788s
user 2m21.384s
sys 0m8.263s

Regarding standard input, I can run parsort using a wrapper script (given at the top of this email thread). Notice how parsort has better throughput running 38 workers.

$ time parsort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m19.553s
user 0m14.030s
sys 0m3.520s

$ time mcesort -j64 </dev/shm/huge | cksum
3409526408 2910585600

real 0m18.312s
user 2m42.042s
sys 0m11.546s

$ time parsort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m17.609s
user 0m11.856s
sys 0m3.451s

$ time mcesort -j38 </dev/shm/huge | cksum
3409526408 2910585600

real 0m16.819s
user 2m21.108s
sys 0m9.523s

I find it interesting in not seeing the total user time running parsort (tally of all workers' time).

This was a challenge and can see the finish line, hoping by next week.

On Fri, Feb 17, 2023 at 2:49 PM Rob Sargent <robjsargent@gmail.com> wrote:

On 2/17/23 13:41, Mario Roy wrote:

It looks like we may not get what we kindly asked for. So, I started making "mcesort" using Perl MCE's chunking engine.

On Thu, Feb 16, 2023 at 5:08 AM Nigel Stewart <nigels@nigels.com> wrote:

Can you elaborate on what I am missing from the picture?

Ole,

Perhaps your workloads are more CPU and I/O intensive, and latency is less of a priority.
If the workload is memory-intensive, that can be the more important constraint than
the number of available cores. If the workload is interactive (latency-sensitive) it's
undesirable to have too many jobs in flight competing for CPU and I/O, delaying each other.

- Nigel

Are you in the high memory consumption scenario which Nigel describes?

If you're going to develop it anyway, you could try submitting a patch to GNU Parallel.

From:	Mario Roy
Subject:	Re: Cannot specify the number of threads for parsort
Date:	Thu, 9 Mar 2023 04:24:43 -0600