Aloha,
$ parallel --version
GNU parallel 20230122
This is a wish list for allowing one to specify the number of threads via an ENVIRONMENT variable that works consistently using parallel or parsort. Basically, I want to specify the number of threads for parsort regardless of processing files specified via command-line arguments or STDIN.
In the meantime, I created a wrapper script that is placed in a path (/usr/local/bin) before (/usr/bin) where parallel resides.
#!/usr/bin/env bash
# Wrapper script for parallel.
# Whoa!!! GNU Parallels assumes you want to consume all CPU cores.
# Unfortunately, one cannot specify the number of threads for parsort.
CMD="/usr/bin/parallel"
if [[ -z "$PARALLEL_NUM_THREADS" ]]; then
exec "$CMD" "$@"
elif [[ "$#" -eq 1 && "$1" == "--number-of-threads" ]]; then
echo $PARALLEL_NUM_THREADS; exit 0
elif [[ "$1" == "-j" ]]; then
shift; shift; exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"
else
exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"
fi
Use case:
export PARALLEL_NUM_THREADS=6
LC_ALL=C parsort -k1 big{1,2,3}.txt | tally-count | LC_ALL=C parsort -k2nr >out.txt
cat big{1,2,3}.txt | LC_ALL=C parsort -k1 | tally-count | LC_ALL=C parsort -k2nr >out.txt
The big files are two column key-value pairs delimited by a tab. The output contains duplicate key names.
The tally-count command sums adjacent count fields of duplicate key names. The output contains unique key names.
Then sorted by sum descending order, keyname ascending order.
Blessings and grace,
- Mario