[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: datamash performance question

From: Jake VanEck
Subject: Re: datamash performance question
Date: Fri, 25 Jun 2021 17:36:26 -0400

So far, this option seems to be putting the data into memory, which I will far exceed. After just a few minutes, mawk is using over 3gb of memory and nothing is returned per your comment about how it will keep the running sums in memory and write them out when the input exhausted. So, I guess my problem is; the "input" won't be exhausted for many GB of data.....which is also why datamash was working so wonderfully  

Any way to run datamash in parallel?


On Fri, Jun 25, 2021 at 4:43 PM Dima Kogan <dima@secretsauce.net> wrote:
Jake VanEck <jake.vaneck@gmail.com> writes:

> I've tried similar commands but doesn't awk need to put the entire dataset
> into memory for this?

No. Absolutely not. It will read the input one line at a time, keeping
the running sums in memory, and it will write out the sums when the
input is exhausted.

If you care about performance, try out mawk specifically. It's a bit
snappier than other implementations.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]