[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort --random-sort
From: |
Jim Meyering |
Subject: |
Re: sort --random-sort |
Date: |
Sat, 26 Nov 2005 16:03:10 +0100 |
Frederik Eaton <address@hidden> wrote:
...
>> This is my second (or third? or fourth?) attempt at a patch to sort to
>> introduce shuffling behavior. However, I encourage people to take a
Thanks for persevering.
>> good look at it because it's much more "polished" than the others.
>> I've included everything I think should be included - documentation,
>> ChangeLog entries, etc., so hopefully there won't have to be many more
>> changes.
>>
>> I'm sorry it's taken me so long to return to this - I've moved
>> countries, been occupied with grad school, etc.
>>
>> There are two patches:
>>
>> - shred-isaac-split.patch
>> splits ISAAC code out of shred.c into new files isaac.c, isaac.h
>>
>> - sort-with-isaac.patch
>> adds --sort-random etc. to sort.c, using ISAAC RNG.
>>
>> They should be applied in this order.
>>
>> Notes:
>>
>> - The randomization is "cryptographically secure" due to use of ISAAC,
>> as has been requested by Paul Eggert. Earlier versions used a separate
>> random number implementation but it was pointed out that shred.c
>> already contained the necessary code (based on a library from Bob
>> Jenkins called ISAAC), so the main change in this version has been to
>> separate out our version of ISAAC from shred.c and use it instead.
>> Ultimately I hope this can be moved to glibc, so that making things
>> cryptographically secure doesn't require any extra effort, but that's
>> a separate issue.
>>
>> - The second patch changes a constant ISAAC_LOG in isaac.h from 8 to
>> 3. The code explicitly says that this can be done, so, going by that
I'm leery of this change.
Won't such a reduction in the size of the state array
have a detrimental effect on quality of the random numbers?
Maybe we need an option to trade off speed for quality,
if it makes such a big difference.
I've included some comments below, mostly for improved
maintainability/readability, and to keep this code consistent
with existing style and practices used in coreutils.
isaac.c should eventually move to lib/, but for now
it's ok to leave it in src/. Even in src/, the file name
could be more evocative of its function. How about rand-isaac.c?
> --- coreutils-cvs-1-shred-isaac-split/src/sort.c 2005-10-07
> 19:48:28.000000000 +0100
> +++ coreutils-cvs-2-sort-with-isaac/src/sort.c 2005-11-26
> 07:19:31.000000000 +0000
> @@ -38,6 +38,8 @@
> #include "strnumcmp.h"
> #include "xmemcoll.h"
> #include "xstrtol.h"
> +#include "isaac.h"
> +#include "md5.h"
Alphabetize.
...
> @@ -1152,6 +1163,29 @@
> return 0;
> }
>
> +struct isaac_state *rand_state;
The above should be `static'.
> +#define HASH_WORDS 1
> +#define HASH_SIZE (HASH_WORDS * sizeof (uint32_t))
> +
> +static void
> +get_hash (char* text, int len, uint32_t resbuf[/*HASH_WORDS*/])
The `text' parameter should be `const'.
`len' should be of type size_t.
> +{
> + struct isaac_state *s = alloca (sizeof (struct isaac_state));
> + struct irand_state *r = alloca (sizeof (struct irand_state));
`r' is unused, so you can remove it.
Don't use alloca here. Instead, just declare `s' like this
struct isaac_state s;
and adjust the uses.
> + memcpy (s, rand_state, sizeof (struct isaac_state));
Please do this instead:
memcpy (s, rand_state, sizeof *s);
...
> + else if (key->random_hash)
> + {
> + char diga[HASH_SIZE];
> + char digb[HASH_SIZE];
If you declare these to be of type `uint32_t diga[HASH_WORDS]',
then you can remove the casts below:
> + get_hash (texta, lena, (uint32_t*) diga);
> + get_hash (textb, lenb, (uint32_t*) digb);
> + diff = memcmp (diga, digb, sizeof (diga));
> + }
...
> @@ -2109,6 +2155,8 @@
> int c = 0;
> bool checkonly = false;
> bool mergeonly = false;
> + char *randseed = 0;
randseed can be (and hence should be) `char const *'
...
> + case SEED_OPTION:
> + randseed = strdupa (optarg);
Using strdupa is not portable.
Use xstrdup instead.
> - for (key = keylist; key; key = key->next)
...
> + for (key = keylist; key; key = key->next)
I wondered why the preceding lines appeared in the diff output.
Then saw that you'd added a trailing blank.
Please remove it.
> + {
> + need_rand |= key->random_hash;
> + if (! (key->ignore || key->translate
> + || (key->skipsblanks | key->reverse
> + | key->skipeblanks | key->month | key->numeric
> + | key->general_numeric
> + | key->random_hash)))
> + {
> + key->ignore = gkey.ignore;
> + key->translate = gkey.translate;
> + key->skipsblanks = gkey.skipsblanks;
> + key->skipeblanks = gkey.skipeblanks;
> + key->month = gkey.month;
> + key->numeric = gkey.numeric;
> + key->general_numeric = gkey.general_numeric;
> + key->random_hash = gkey.random_hash;
> + key->reverse = gkey.reverse;
> + }
> + }
> +
> + if (need_rand) {
Put each opening curly brace on its own line.
Same for the others.
> + rand_state = xmalloc (sizeof (struct isaac_state));
> + memset (rand_state, 0, sizeof (struct isaac_state));
Use xzalloc (from lib/xalloc.h), instead of the two preceding lines:
rand_state = xzalloc (sizeof *rand_state);
> + if(randseed) {
Add a space after the `if'
> + isaac_seed_start (rand_state);
> + isaac_seed_data (rand_state, randseed, strlen (randseed));
> + isaac_seed_finish (rand_state);
> + } else
Don't cuddle the `else'.
> + isaac_seed (rand_state);
> + }
- sort --random-sort, Frederik Eaton, 2005/11/26
- Re: sort --random-sort, Frederik Eaton, 2005/11/26
- Re: sort --random-sort,
Jim Meyering <=
- Re: sort --random-sort, Andreas Schwab, 2005/11/26
- Re: sort --random-sort, Frederik Eaton, 2005/11/26
- Re: sort --random-sort, Jim Meyering, 2005/11/29
- Re: sort --random-sort, Paul Eggert, 2005/11/29
- Re: sort --random-sort, Frederik Eaton, 2005/11/30
- Re: sort --random-sort, Jim Meyering, 2005/11/30
- Re: sort --random-sort, Frederik Eaton, 2005/11/30
- Re: sort --random-sort, Paul Eggert, 2005/11/28
- Re: sort --random-sort, Jim Meyering, 2005/11/28
- Re: sort --random-sort, Frederik Eaton, 2005/11/28