|Subject:||Re: [Bug-datamash] Feature request: percentiles|
|Date:||Tue, 14 Mar 2017 22:28:33 -0700|
Sorry for the delayed response.
> On Mar 6, 2017, at 02:57, Barry Nisly <address@hidden> wrote:
> I just found out about datamash and I want to thank you for creating such a useful tool.
Thank you for your kind words.
> My request is to add percentile in addition to the quartile calculations.
> I typically deal with latencies and am interested in 90, 95, or 99 percentiles. Arbitrary percentiles would be great but, in looking at the code, it doesn’t seem easy to implement. Creating hardcoded percentile calculations (e.g., 90, 95, 99) would be simple (adding the opcodes and connecting them to percentile_value() in src/utils.c.
> Ideally, I could specify an arbitrary percentile, e.g., ‘percentile_93’ and have the parser parse out the percentile and pass it along with the ‘percentile’ opcode.
> I may take a crack at implementing this as time permits and if there is any interest in the feature.
I like this idea very much.
If I may suggest:
There are already two operations that accept a parameter: 'bin' and 'strbin'.
In their case the optional parameter determines the bucket size.
e.g. default bucket size of 100:
seq 1 500 | datamash --full bin 1
vs bucket size of 10:
seq 1 500 | datamash --full bin:10 1
The parser (in op-parser.c) already takes the value after a ':' and uses it as a parameter.
The function op-parser.c:set_op_params() checks if the parameter can be used with the requested operation.
I would try to implement a 'percentile' operation exactly in that way (in terms of parsing).
In terms of processing, it should probably be a case very similar to OP_QUARTILE_1/3/IQR/MEDIAN
Please do try your hand at it and i'm happy to help making it work. Also feel free to send partial patches and we'll discuss and improve them.
I apologize in advance if my replies are a bit delayed - a bit hectic at work at the moment.
Description: GNU Zip compressed data
|[Prev in Thread]||Current Thread||[Next in Thread]|