[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Frederik Eaton |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Sat, 16 Jul 2005 07:01:53 -0700 |
User-agent: |
Mutt/1.5.9i |
> Also, each external symbol (function, macro, variable) should have a comment
> explaining what it does. Currently I'm at a bit of a loss trying to
> figure out what things do, so my comments will be limited.
>
> +#ifndef _CHECKSUM_H
> +#define _CHECKSUM_H 1
> +
> +#include <sys/types.h>
> +#include <config.h>
> +#include "sha1.h"
> +#include "md5.h"
>
> This seems overkill to me. sort is just using md5, right?
I wanted an easy way to switch between them. Perhaps I should have
gotten rid of this macro stuff from md5sum.c as well, but at least
what I added lets someone else take that on in the future:
#define DIGEST_TYPE_STRING(Alg) ((Alg) == ALG_MD5 ? "MD5" : "SHA1")
#define DIGEST_STREAM(Alg) ((Alg) == ALG_MD5 ? md5_stream : sha_stream)
#define DIGEST_BITS(Alg) ((Alg) == ALG_MD5 ? 128 : 160)
#define DIGEST_HEX_BYTES(Alg) (DIGEST_BITS (Alg) / 4)
#define DIGEST_BIN_BYTES(Alg) (DIGEST_BITS (Alg) / 8)
> + int len = strlen(str);
>
> There should be a space before the "(". There are several other instances
> of that.
>
> + return (len*2) >= ops->bits;
>
> Please put spaces around the "*". The parentheses aren't needed here.
It surprises me that this seems so important to you guys, but
whatever.
> Also, we prefer <= to >=. This is a programming style rule that I
> learned from D. Val Schorre. It's an application of Leibnitz's
"Leibniz's"
> criterion for notation: textual order should reflect numeric order.
Interesting.
> + void *ctx = alloca(digops.ctx_size);
>
> What are the bounds on digops.ctx_size? If it can be large (more than
> a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow
> detection issues.
It's only going to be a few bytes.
> + else if (key->random_hash)
> + {
> + int dig_bytes = digops.bits/8;
> + char diga[dig_bytes];
> + char digb[dig_bytes];
> + get_hash(texta, lena, diga);
> + get_hash(textb, lenb, digb);
> + diff = memcmp(diga, digb, sizeof(diga));
> + }
>
> It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n,
> and -r. For example, sort -nR should compute the same hash for 1.0
> and 01.0 that it computes for 1.00. Currently, though, the -R is
> silently ignored in this case. Conversely, sort -MR should compute
> the same hash for " Jan" that it does for "Jan"; but here, the "-M" is
> silently ignored.
I disagree that this is important.
> + -R, --random sort by random hash of keys\n\
>
> I would prefer the long option name --random-hash, since it's not a
> truly random sort.
What, in the sense that it pays attention to the keys, or that the
numbers are pseudorandom? I might call something which pays attention
to keys a random sort, and something which doesn't a random shuffle. I
don't like your suggestion because I don't think the implementation
should show up as part of the API.
If it's the pseudorandomness, I think mentioning that is redundant,
and the same thing I said about not wanting implementation in the API
applies - a good pseudorandom number generator should be externally
indistinguishable from an actual random number generator.
> There is a key problem here: one of documentation. doc/coreutils.texi
> and NEWS need to be modified to say exactly what's going on, and why
> sorting based on a random hash is not the same as generating a random
> permutation, so that people who are doing security-relevant stuff
> won't rely on something that they shouldn't.
Yes, I mentioned that I haven't written documentation yet. I will keep
in mind that these files need to be modified.
> Also, I can't easily tell from the code how many random bits are
> currently acquired from the environment, and whether there are enough
> bits to actually satisfy the claim that we are using a random hash.
It should be 128 bits for MD5 and 160 for SHA1. Do you think I need
more?
Frederik