Re: new coreutil? shuffle - randomize file contents

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents

From:	Frederik Eaton
Subject:	Re: new coreutil? shuffle - randomize file contents
Date:	Sat, 16 Jul 2005 07:01:53 -0700
User-agent:	Mutt/1.5.9i

> Also, each external symbol (function, macro, variable) should have a comment
> explaining what it does.  Currently I'm at a bit of a loss trying to
> figure out what things do, so my comments will be limited.
> 
> +#ifndef _CHECKSUM_H
> +#define _CHECKSUM_H 1
> +
> +#include <sys/types.h>
> +#include <config.h>
> +#include "sha1.h"
> +#include "md5.h"
> 
> This seems overkill to me.  sort is just using md5, right?

I wanted an easy way to switch between them. Perhaps I should have
gotten rid of this macro stuff from md5sum.c as well, but at least
what I added lets someone else take that on in the future:

#define DIGEST_TYPE_STRING(Alg) ((Alg) == ALG_MD5 ? "MD5" : "SHA1")
#define DIGEST_STREAM(Alg) ((Alg) == ALG_MD5 ? md5_stream : sha_stream)

#define DIGEST_BITS(Alg) ((Alg) == ALG_MD5 ? 128 : 160)
#define DIGEST_HEX_BYTES(Alg) (DIGEST_BITS (Alg) / 4)
#define DIGEST_BIN_BYTES(Alg) (DIGEST_BITS (Alg) / 8)

> +  int len = strlen(str);
> 
> There should be a space before the "(".  There are several other instances
> of that.
>
> +  return (len*2) >= ops->bits;
> 
> Please put spaces around the "*".  The parentheses aren't needed here.

It surprises me that this seems so important to you guys, but
whatever.

> Also, we prefer <= to >=.  This is a programming style rule that I
> learned from D. Val Schorre.  It's an application of Leibnitz's
"Leibniz's"
> criterion for notation: textual order should reflect numeric order.

Interesting. 

> +  void *ctx = alloca(digops.ctx_size);
> 
> What are the bounds on digops.ctx_size?  If it can be large (more than
> a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow
> detection issues.

It's only going to be a few bytes.

> +      else if (key->random_hash)
> +     {
> +       int dig_bytes = digops.bits/8;
> +       char diga[dig_bytes];
> +       char digb[dig_bytes];
> +       get_hash(texta, lena, diga);
> +       get_hash(textb, lenb, digb);
> +       diff = memcmp(diga, digb, sizeof(diga));
> +     }
> 
> It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n,
> and -r.  For example, sort -nR should compute the same hash for 1.0
> and 01.0 that it computes for 1.00.  Currently, though, the -R is
> silently ignored in this case.  Conversely, sort -MR should compute
> the same hash for " Jan" that it does for "Jan"; but here, the "-M" is
> silently ignored.

I disagree that this is important.

> +  -R, --random                sort by random hash of keys\n\
> 
> I would prefer the long option name --random-hash, since it's not a
> truly random sort.

What, in the sense that it pays attention to the keys, or that the
numbers are pseudorandom? I might call something which pays attention
to keys a random sort, and something which doesn't a random shuffle. I
don't like your suggestion because I don't think the implementation
should show up as part of the API.

If it's the pseudorandomness, I think mentioning that is redundant,
and the same thing I said about not wanting implementation in the API
applies - a good pseudorandom number generator should be externally
indistinguishable from an actual random number generator.

> There is a key problem here: one of documentation.  doc/coreutils.texi
> and NEWS need to be modified to say exactly what's going on, and why
> sorting based on a random hash is not the same as generating a random
> permutation, so that people who are doing security-relevant stuff
> won't rely on something that they shouldn't.

Yes, I mentioned that I haven't written documentation yet. I will keep
in mind that these files need to be modified.

> Also, I can't easily tell from the code how many random bits are
> currently acquired from the environment, and whether there are enough
> bits to actually satisfy the claim that we are using a random hash.

It should be 128 bits for MD5 and 160 for SHA1. Do you think I need
more?

Frederik

[Prev in Thread]

Current Thread

[Next in Thread]

Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/07/14
- Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/07/15
  - Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/07/15
    - Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/07/15
    - autoreconf vs bootstrap script (was: new coreutil? shuffle - randomize file contents), Bob Proulx, 2005/07/15
    - Re: autoreconf vs bootstrap script, Paul Eggert, 2005/07/16
    - Re: new coreutil? shuffle - randomize file contents, Frederik Eaton <=
    - Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/07/16
- Re: new coreutil? shuffle - randomize file contents, Eric Blake, 2005/07/15
  - Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/07/15

Prev by Date: Re: more gcc warnings
Next by Date: cp command interactive mode priority
Previous by thread: Re: autoreconf vs bootstrap script
Next by thread: Re: new coreutil? shuffle - randomize file contents
Index(es):
- Date
- Thread