bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: efficient version of 'sort | uniq -c | sort -n'?


From: Matthew Woehlke
Subject: Re: efficient version of 'sort | uniq -c | sort -n'?
Date: Mon, 21 May 2007 16:27:33 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070221 Thunderbird/1.5.0.10 Mnenhy/0.7.4.0

Philip Rowlands wrote:
On Mon, 21 May 2007, Matthew Woehlke wrote:
I thought about that, but /maximum/ efficiency is only achievable doing everything in one go. Anyway I think 'countitems' would still be a big improvement; I would do that as 'sort --unique-with-count' (preferably aliased 'sort -U') since IMO this is a missing feature of 'sort -u'.

You don't really want to do the first sort at all - it's just a convenient way of creating the buckets. The relative order of each bucket is unimportant, but that's what sort spends a long time calculating.

Right. That's why the current solution 'sort | uniq -c | sort -n' sucks. :-)

Actually IIRC the replacement I wrote does exactly this, it builds an AVL tree of "keys" that tracks the frequency (although the AVL tree has the side effect of also sorting, the point is to speed key lookup; a hash could also be used, although I suspect for my particular use case the hash might actually be /slower/).

The trailing "sort" could be done inside perl, but it doesn't help the (algorithmic) efficiency, and we're not playing perl golf...

Doesn't help? Not in the strict sense (in that bucket-distribution is of a higher order of complexity), no, but it removes the need for a int -> string -> int conversion which certainly has some cost, especially if there are many "buckets". Is it really, really important to avoid this cost? Well... possibly not (although I would want to have hard numbers before making that call). As I said, "I think 'countitems' would still be a big improvement".

Anyway, neither of my original questions has been answered yet :-). I'm guessing the answer to the first is "no".

--
Matthew
When in doubt, duct tape!





reply via email to

[Prev in Thread] Current Thread [Next in Thread]