[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-datamash] histograms and/or CDFs

From: Assaf Gordon
Subject: Re: [Bug-datamash] histograms and/or CDFs
Date: Wed, 13 Aug 2014 15:20:29 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Hello Miah,

On 08/13/2014 01:48 PM, Miah Ness wrote:
Thanks for developing this nice tool. I've been looking for something
like this for years, and always resorted to perl/awk one-liners.

Thank you for your kind words.
I'm glad to see others are finding 'datamash' useful, I've been looking for 
such a tool for a long time myself :)

I should note that 'datamash' is not only shorter and more concise then those 
it also performs stricter input validation (where awk/perl silently ignore 
stuff unless explicitly tested) -
which is why I prefer 'datamash' for scripting and automation.

I'm planning to write a web-page to illustrate this issue soon.

Would you be interested in integrating support for histograms and/or
cumulative distribution functions?
< snip >

The hist operator takes three arguments:


<start> is the starting bucket value, <size> is the size of each
bucket, and <count> is the number of buckets.


This is a terrific idea, which fits exactly where I'd imagine datamash to help.

There is currently a technical limitation that each operator takes just a 
single parameter
(the column number, or column name in the coming version),
but I was contemplating on accepting multiple parameters, and your syntax is a 
good suggestion.

I'll try to add a histogram operation, and see how it works out.

In the mean time, though still a bit cumbersome,
you could try:

$ cat data | \
    awk '{ $1 = int(($1/10))*10 ; print $0 }' | \
   ./datamash -g 1 count 1
   0    3
   10   2
   30   4

The 'awk' takes care of the binning, and datamash does the counting.

Regarding you syntax:
How do you think it should handle values smaller than "start" or larger than 
the largest bucket size?
It could fail with an error (easier to program but bad for the user);
It could bin the outlier values into the first/last bins (excel's histogram 
does this, but it might be confusing to users);

Or it could allow 'offset' instead of 'start', which defaults to zero, and 'bucket size', then 
there's no "first" or "Last" buckets - 'datamash' will simply output as many 
bucket/bins as there are in the input file:
    $ cat file | datamash bin:10 1
will bin the values in the first column to buckets of size 10.

Additionally, what do you think of splitting this into two operations: 'bin' 
and 'count' (using the existing 'count') ?

Such as:
   $ cat data | datamash bin:0:10:5 1 | datamash -g 1 count 1
   0    3
   10   2
   30   4

 - Assaf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]