Re: sort --random-sort

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort --random-sort

From:	Jim Meyering
Subject:	Re: sort --random-sort
Date:	Sat, 26 Nov 2005 16:03:10 +0100

Frederik Eaton <address@hidden> wrote:
...
>> This is my second (or third? or fourth?) attempt at a patch to sort to
>> introduce shuffling behavior. However, I encourage people to take a

Thanks for persevering.

>> good look at it because it's much more "polished" than the others.
>> I've included everything I think should be included - documentation,
>> ChangeLog entries, etc., so hopefully there won't have to be many more
>> changes.
>>
>> I'm sorry it's taken me so long to return to this - I've moved
>> countries, been occupied with grad school, etc.
>>
>> There are two patches:
>>
>> - shred-isaac-split.patch
>>     splits ISAAC code out of shred.c into new files isaac.c, isaac.h
>>
>> - sort-with-isaac.patch
>>     adds --sort-random etc. to sort.c, using ISAAC RNG.
>>
>> They should be applied in this order.
>>
>> Notes:
>>
>> - The randomization is "cryptographically secure" due to use of ISAAC,
>> as has been requested by Paul Eggert. Earlier versions used a separate
>> random number implementation but it was pointed out that shred.c
>> already contained the necessary code (based on a library from Bob
>> Jenkins called ISAAC), so the main change in this version has been to
>> separate out our version of ISAAC from shred.c and use it instead.
>> Ultimately I hope this can be moved to glibc, so that making things
>> cryptographically secure doesn't require any extra effort, but that's
>> a separate issue.
>>
>> - The second patch changes a constant ISAAC_LOG in isaac.h from 8 to
>> 3. The code explicitly says that this can be done, so, going by that

I'm leery of this change.
Won't such a reduction in the size of the state array
have a detrimental effect on quality of the random numbers?

Maybe we need an option to trade off speed for quality,
if it makes such a big difference.

I've included some comments below, mostly for improved
maintainability/readability, and to keep this code consistent
with existing style and practices used in coreutils.

isaac.c should eventually move to lib/, but for now
it's ok to leave it in src/.  Even in src/, the file name
could be more evocative of its function.  How about rand-isaac.c?


> --- coreutils-cvs-1-shred-isaac-split/src/sort.c      2005-10-07 
> 19:48:28.000000000 +0100
> +++ coreutils-cvs-2-sort-with-isaac/src/sort.c        2005-11-26 
> 07:19:31.000000000 +0000
> @@ -38,6 +38,8 @@
>  #include "strnumcmp.h"
>  #include "xmemcoll.h"
>  #include "xstrtol.h"
> +#include "isaac.h"
> +#include "md5.h"

Alphabetize.

...
> @@ -1152,6 +1163,29 @@
>    return 0;
>  }
>
> +struct isaac_state *rand_state;

The above should be `static'.

> +#define HASH_WORDS 1
> +#define HASH_SIZE (HASH_WORDS * sizeof (uint32_t))
> +
> +static void
> +get_hash (char* text, int len, uint32_t resbuf[/*HASH_WORDS*/])

The `text' parameter should be `const'.
`len' should be of type size_t.

> +{
> +  struct isaac_state *s = alloca (sizeof (struct isaac_state));
> +  struct irand_state *r = alloca (sizeof (struct irand_state));

`r' is unused, so you can remove it.
Don't use alloca here.  Instead, just declare `s' like this

  struct isaac_state s;

and adjust the uses.

> +  memcpy (s, rand_state, sizeof (struct isaac_state));

Please do this instead:
  memcpy (s, rand_state, sizeof *s);

...
> +      else if (key->random_hash)
> +        {
> +          char diga[HASH_SIZE];
> +          char digb[HASH_SIZE];

If you declare these to be of type `uint32_t diga[HASH_WORDS]',
then you can remove the casts below:

> +          get_hash (texta, lena, (uint32_t*) diga);
> +          get_hash (textb, lenb, (uint32_t*) digb);
> +          diff = memcmp (diga, digb, sizeof (diga));
> +        }

...
> @@ -2109,6 +2155,8 @@
>    int c = 0;
>    bool checkonly = false;
>    bool mergeonly = false;
> +  char *randseed = 0;

randseed can be (and hence should be) `char const *'

...
> +        case SEED_OPTION:
> +          randseed = strdupa (optarg);

Using strdupa is not portable.
Use xstrdup instead.

> -  for (key = keylist; key; key = key->next)
...
> +  for (key = keylist; key; key = key->next) 

I wondered why the preceding lines appeared in the diff output.
Then saw that you'd added a trailing blank.
Please remove it.

> +    {
> +      need_rand |= key->random_hash;
> +      if (! (key->ignore || key->translate
> +             || (key->skipsblanks | key->reverse
> +                 | key->skipeblanks | key->month | key->numeric
> +                 | key->general_numeric
> +                 | key->random_hash)))
> +        {
> +          key->ignore = gkey.ignore;
> +          key->translate = gkey.translate;
> +          key->skipsblanks = gkey.skipsblanks;
> +          key->skipeblanks = gkey.skipeblanks;
> +          key->month = gkey.month;
> +          key->numeric = gkey.numeric;
> +          key->general_numeric = gkey.general_numeric;
> +          key->random_hash = gkey.random_hash;
> +          key->reverse = gkey.reverse;
> +        }
> +    }
> +
> +  if (need_rand) {

Put each opening curly brace on its own line.
Same for the others.

> +    rand_state = xmalloc (sizeof (struct isaac_state));
> +    memset (rand_state, 0, sizeof (struct isaac_state));

Use xzalloc (from lib/xalloc.h), instead of the two preceding lines:

    rand_state = xzalloc (sizeof *rand_state);

> +    if(randseed) {

Add a space after the `if'

> +      isaac_seed_start (rand_state);
> +      isaac_seed_data (rand_state, randseed, strlen (randseed));
> +      isaac_seed_finish (rand_state);
> +    } else

Don't cuddle the `else'.

> +      isaac_seed (rand_state);
> +  }

[Prev in Thread]

Current Thread

[Next in Thread]

sort --random-sort, Frederik Eaton, 2005/11/26
- Re: sort --random-sort, Frederik Eaton, 2005/11/26
  - Re: sort --random-sort, Jim Meyering <=
    - Re: sort --random-sort, Andreas Schwab, 2005/11/26
    - Re: sort --random-sort, Frederik Eaton, 2005/11/26
    - Re: sort --random-sort, Jim Meyering, 2005/11/29
    - Re: sort --random-sort, Paul Eggert, 2005/11/29
    - Re: sort --random-sort, Frederik Eaton, 2005/11/30
    - Re: sort --random-sort, Jim Meyering, 2005/11/30
    - Re: sort --random-sort, Frederik Eaton, 2005/11/30
    - Re: sort --random-sort, Paul Eggert, 2005/11/28
    - Re: sort --random-sort, Jim Meyering, 2005/11/28
    - Re: sort --random-sort, Frederik Eaton, 2005/11/28

Prev by Date: Re: bugs in dirname module - coreutils portion
Next by Date: Re: sort --random-sort
Previous by thread: Re: sort --random-sort
Next by thread: Re: sort --random-sort
Index(es):
- Date
- Thread