Hello,
I'm processing many terabytes (in 200-500gb chunks) of data currently doing something along the lines of this:
time (echo test2 && echo test1 && echo test2 && echo test1 && echo test1 && echo test3) | sort | uniq -c | tr ' ' \\t | sed -e 's/^[ \t]*//' | sort -k 2 | datamash -g2 sum 1 --filler 0 | awk -F"\t" '{print $2"\t"$1}' > output.txt
Right now, the data is coming into the highlighted datamash command much faster than what datamash is able to process it. Do you know if there's any way I can parallelize the datamash command or parallelize that part of the pipe for datamash? Right now my system is bottlenecking at that step and I'm trying to figure out if there's anything I can do to help it go faster.
Any suggestions or ideas on how I can do this with datamash (or anything else for that matter) would be greatly appreciated.
Thank you,
-Jake