[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

datamash performance question

From: Jake VanEck
Subject: datamash performance question
Date: Fri, 25 Jun 2021 13:38:53 -0400


I'm processing many terabytes (in 200-500gb chunks) of data currently doing something along the lines of this: 

time (echo test2 && echo test1 && echo test2 && echo test1 && echo test1 && echo test3) | sort | uniq -c | tr ' ' \\t | sed -e 's/^[ \t]*//' | sort -k 2 | datamash -g2 sum 1 --filler 0 | awk -F"\t" '{print $2"\t"$1}' > output.txt

Right now, the data is coming into the highlighted datamash command much faster than what datamash is able to process it. Do you know if there's any way I can parallelize the datamash command or parallelize that part of the pipe for datamash? Right now my system is bottlenecking at that step and I'm trying to figure out if there's anything I can do to help it go faster.

Any suggestions or ideas on how I can do this with datamash (or anything else for that matter) would be greatly appreciated. 

Thank you,


reply via email to

[Prev in Thread] Current Thread [Next in Thread]