bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support in sort for human-readable numbers


From: Pádraig Brady
Subject: Re: Support in sort for human-readable numbers
Date: Tue, 6 Jan 2009 15:37:06 +0000
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Vitali Lovich wrote:
> I've read the proposed patches that have been batted around on the mailing
> list (after coming up with my own implementation :D of course).  My proposed
> solution is less generic, but I believe more robust, than the other
> approaches.
> 
> I've proposed my reasoning below, but I've posted it as a bug on launchpad
> to track this issue 313152 <https://bugs.launchpad.net/bugs/313152>.  The
> patch is against 6.10 instead of trunk mainly because I was too lazy to get
> the build-system set up on Ubuntu.  That being said, I'm pretty sure the
> patch should still work against the trunk.  In any case, if it's necessary,
> I could also do the diff against the trunk.
> 
> Code review?
> What would I need to do to get this mainlined (aside from adding the
> documentation changes)?
> 
> REASONING:
> 
> One of my major assumption is that all the numbers are well formatted.
> 
> In other words, there's an explicit demarcation in the number line (at least
> internal to the input being sorted) after which the suffix increases and the
> number starts again near 0.  For instance, if M represents 1050 Kilobytes,
> then there's no 1051K - it's represented as 1.001M or something along those
> lines.  Again, this would only rely on the input being internally consistent
> - sort needs no knowledge or hints of what those suffixes represent.

I like the idea.

So it doesn't support sorting these correctly for example:

999MB
998MiB
1GiB
1030MiB

I.E. a mixture of ^2 and ^10 are not supported,
nor overlapping number ranges.

That's probably OK though since a tool shouldn't
generate mixtures like that on a particular run.

You had a few questions in your patch...

+ /* FIXME: add support for aliases (i.e. k/K, m/M, g/G) */

Not necessary I think. You already handle case aliases with toupper().

+  /* FIXME: maybe add option to check for longer suffixes (i.e. gigabyte) */

You should allow at least G, GiB and GB formats.
Probably should print error if more than one of those
formats used, since that's not supported.

+  /* FIXME: maybe add option to allow for spacing between number and suffix */

I wouldn't bother. If we supported general input maybe, but not otherwise.

+  /* FIXME: maybe not use exponent if one of the strings uses an
+     unrecognized suffix that's not a blank */

I would flag that as an error

+  /* FIXME:  a_order - b_order || raw_comparison can be used - would that
+     be faster? */

Yep if you're not supporting overlapping number ranges then
you can skip the number comparison totally if the suffixes don't match.

cheers,
Pádraig.

Received: from tombstone.lincor.com ([84.203.137.218]) by mail1.mxsweep.com 
with Microsoft SMTPSVC(6.0.3790.3959);
         Tue, 6 Jan 2009 15:37:32 +0000
Received: from [192.168.2.25] (crom.labs.lincor.com [192.168.2.25])
        (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by tombstone.lincor.com (Postfix) with ESMTP id B1C25600C209;
        Tue,  6 Jan 2009 15:37:06 +0000 (GMT)
Message-ID: <address@hidden>
Date: Tue, 6 Jan 2009 15:37:06 +0000
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <address@hidden>
User-Agent: Thunderbird 2.0.0.6 (X11/20071008)
MIME-Version: 1.0
To: Vitali Lovich <address@hidden>
CC: address@hidden
Subject: Re: Support in sort for human-readable numbers
References: <address@hidden>
In-Reply-To: <address@hidden>
X-Enigmail-Version: 0.95.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Return-Path: address@hidden
X-OriginalArrivalTime: 06 Jan 2009 15:37:34.0701 (UTC) 
FILETIME=[B22229D0:01C97014]
x-MXSweep-KeywordsCount: 0
x-MXSweep-MetaScanResult: Clean
x-MXSweep-MetaScanThreat:
x-MXPurifier-SpamScore: 0
x-MXPurifier-VirusScore: 0
X-MXSweep-Threat: Clean
X-MXUniqueID: 738389df-d34d-4bcb-b40d-2038ba960ee6

Vitali Lovich wrote:
> I've read the proposed patches that have been batted around on the mailing
> list (after coming up with my own implementation :D of course).  My proposed
> solution is less generic, but I believe more robust, than the other
> approaches.
> 
> I've proposed my reasoning below, but I've posted it as a bug on launchpad
> to track this issue 313152 <https://bugs.launchpad.net/bugs/313152>.  The
> patch is against 6.10 instead of trunk mainly because I was too lazy to get
> the build-system set up on Ubuntu.  That being said, I'm pretty sure the
> patch should still work against the trunk.  In any case, if it's necessary,
> I could also do the diff against the trunk.
> 
> Code review?
> What would I need to do to get this mainlined (aside from adding the
> documentation changes)?
> 
> REASONING:
> 
> One of my major assumption is that all the numbers are well formatted.
> 
> In other words, there's an explicit demarcation in the number line (at least
> internal to the input being sorted) after which the suffix increases and the
> number starts again near 0.  For instance, if M represents 1050 Kilobytes,
> then there's no 1051K - it's represented as 1.001M or something along those
> lines.  Again, this would only rely on the input being internally consistent
> - sort needs no knowledge or hints of what those suffixes represent.

I like the idea.

So it doesn't support sorting these correctly for example:

999MB
998MiB
1GiB
1030MiB

I.E. a mixture of ^2 and ^10 are not supported,
nor overlapping number ranges.

That's probably OK though since a tool shouldn't
generate mixtures like that on a particular run.

You had a few questions in your patch...

+ /* FIXME: add support for aliases (i.e. k/K, m/M, g/G) */

Not necessary I think. You already handle case aliases with toupper().

+  /* FIXME: maybe add option to check for longer suffixes (i.e. gigabyte) */

You should allow at least G, GiB and GB formats.
Probably should print error if more than one of those
formats used, since that's not supported.

+  /* FIXME: maybe add option to allow for spacing between number and suffix */

I wouldn't bother. If we supported general input maybe, but not otherwise.

+  /* FIXME: maybe not use exponent if one of the strings uses an
+     unrecognized suffix that's not a blank */

I would flag that as an error

+  /* FIXME:  a_order - b_order || raw_comparison can be used - would that
+     be faster? */

Yep if you're not supporting overlapping number ranges then
you can skip the number comparison totally if the suffixes don't match.

cheers,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]