datamash performance question

From:

Jake VanEck

Subject:

Date:

Fri, 25 Jun 2021 13:38:53 -0400

Hello,

I'm processing many terabytes (in 200-500gb chunks) of data currently doing something along the lines of this:

time (echo test2 && echo test1 && echo test2 && echo test1 && echo test1 && echo test3) | sort | uniq -c | tr ' ' \\t | sed -e 's/^[ \t]*//' | sort -k 2 | datamash -g2 sum 1 --filler 0 | awk -F"\t" '{print $2"\t"$1}' > output.txt

Right now, the data is coming into the highlighted datamash command much faster than what datamash is able to process it. Do you know if there's any way I can parallelize the datamash command or parallelize that part of the pipe for datamash? Right now my system is bottlenecking at that step and I'm trying to figure out if there's anything I can do to help it go faster.

Any suggestions or ideas on how I can do this with datamash (or anything else for that matter) would be greatly appreciated.

Thank you,

-Jake

[Prev in Thread]

Current Thread

[Next in Thread]