bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Makes sort create random order


From: Paul Eggert
Subject: Re: [PATCH] Makes sort create random order
Date: Fri, 28 Jan 2005 12:00:40 -0800
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

Frederik Eaton <address@hidden> writes:

> You don't get exactly the same behavior, e.g. the final distinction
> between 3 and 3.0 disappears, but I think this is OK.

But this would cause the randomization option to not be orthogonal to
the rest of the options.  The resulting behavior would be hard to
document, and it would be hard to handle bug reports when it didn't
behave the way users expected.

We should have a simple way to explain the proposed behavior, a way
that works in general.

> c) whether the sorting should depend on the order of the input keys or
> the keys themselves
>
> I want it to depend on the keys themselves.

No, this issue should be orthogonal to the rest of "sort".  In the
rest of "sort", input order matters when you specify the -s (--stable)
option; otherwise it doesn't matter.  Randomness shouldn't change this.

> Why don't you think I want to retain ties? If two keys are equal, and
> you take their hash, they are still equal.

But you appended a different random number to each key, right?  So
they won't be equal after that.

Perhaps I didn't understand your proposed algorithm.  Did you intend
to (say) append the same random number to each key, compute an MD5
checksum of the result, and then sort based on the checksum?

> As for the nature of the investigations, well, anything for which
> one needs a random permutation, I suppose. Also, random sampling
> with sort -R | head, though somewhat inefficient, but convenient,

But these uses should not attempt to sort ties together.  They should
attempt to sort them separately.

Perhaps it would be better to think of things in a completely
different way.  Currently "sort" lets you specify keys that can be any
contiguous set of input columns.  Perhaps we should extend the syntax
for keys, so that the option "-k random" means to use an imaginary
extra column of completely random values.  That would be easy to
explain to people.  It would also let people decide for themselves
what to do with ties.

One problem with implementing this idea efficiently is that no matter
what size you use for the random values, it's wrong.  If the size is
too small, you may get collisions.  If it's too large, "sort" will be
too inefficient.  Besides, you shouldn't have to store the random
values at all -- they're just an internal concept.  For example, when
the sort is so large that you need temporary files, there should be no
need to save the random numbers in those files, as you can always
generate new ones later.  But doing this correctly, so that the result
is still a uniform distribution of permutations, will be a bit tricky.

> would something like md5 be too slow?  Would something much simpler
> be insufficiently random?

Why not let the user specify a source for randomness?  The
sufficiently paranoid can specify /dev/random.  Those who want
reproducible results can specify a simple pseudorandom generator.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]