coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] shuf: use reservoir-sampling when possible


From: Pádraig Brady
Subject: Re: [PATCH] shuf: use reservoir-sampling when possible
Date: Thu, 07 Mar 2013 01:24:55 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 03/06/2013 11:50 PM, Assaf Gordon wrote:
> Hello,
> 
> Attached is a suggestion to implement reservoir-sampling in shuf:
> When the expected output of lines is known, it will not load the entire file 
> into memory - allowing shuffling very large inputs.
> 
> I've seen this mentioned once:
>  http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html
> but no follow-up discussion.

Very nice, thanks!

>>From b64d5063e26c0f3485d8342a2d5501f655f1063e Mon Sep 17 00:00:00 2001
> From: Assaf Gordon <address@hidden>
> Date: Wed, 6 Mar 2013 18:25:49 -0500
> Subject: [PATCH] shuf: use reservoir-sampling when possible
> 
> * src/shuf.c: Use reservoir-sampling when the number of output lines
> is known (by using '-n X' parameter).
> read_input_reservoir_sampling() - read lines from input file, and keep
> only K lines in memory, replacing lines with decreasing probability.
> prepare_shuf_lines() - convert reservoir lines to a usable structure.
> main() - if the number of lines is known, use reservoir-sampling
> instead of reading entire input file.

It might be worth adding a link to:
http://en.wikipedia.org/wiki/Reservoir_sampling

> ---
>  src/shuf.c |  171 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 167 insertions(+), 4 deletions(-)
> 
> diff --git a/src/shuf.c b/src/shuf.c

>    {"zero-terminated", no_argument, NULL, 'z'},
> +  {"-debug", no_argument, NULL, DEV_DEBUG_OPTION},

no need to keep this, for final commit.

>  }
>  
> +/* Covnerts a 'struct linebuffer[]' into a shuf-compatible '**lines' buffer.

/converts/

> +
> +   The 'in_lines[]' is an array of n elements of 'struct linebuffer' .
> +
> +   'out_lines' will be initialized as so:
> +     *outlines        - points to an array of '*char'
> +     (*outlines)[0]   - points to the entire buffer (containing all lines).
> +     (*outlines)[1]   - points to the beginning of the second line
> +                        (inside the buffer of (*outlines)[0] ).
> +     (*outlines)[K-1] - points to the beginning of the last line
> +                        (inside the buffer of (*outlines)[0] ).
> +     (*outlines)[K]   - points to the position one byte past the end
> +                        of the last line.
> +
> +   All lines include the line-terminator character (either NUL or \n).
> +   Length of strings should be decuded by subtracting pointers, not with 
> strlen.
> +   (see write_permuted_output() for the expected usage).
> + */
> +static void
> +prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char ***out_lines,
> +                    char eolbyte)
> +{
> +  size_t i;
> +  size_t size = 0;
> +  char* p;
> +  char* buf;
> +
> +  for (i = 0; i < n; ++i)
> +    size += in_lines[i].length;
> +
> +  p = buf = xmalloc (size);
> +  *out_lines = xnmalloc (n+1, sizeof (**out_lines));
> +  for (i = 0; i < n; ++i)
> +    {
> +        (*out_lines)[i] = p;
> +        memcpy (p, in_lines[i].buffer, in_lines[i].length);
> +        p += in_lines[i].length;
> +    }
> +  (*out_lines)[i] = p;

I've not looked into the details, but it would
be nice to avoid the memcpy/conversion here

> +static size_t
> +read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t 
> k,
> +                               struct randint_source *s)
> +{
> +  size_t i;
> +  size_t n_lines=0;
> +  struct linebuffer line;
> +  struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
> reservoir*/

Since this change is mainly about efficient mem usage we should probably handle
the case where we have small inputs but large k.  This will allocate (and zero)
memory up front. The zeroing will defeat any memory overcommit configured on the
system, but it's probably better to avoid the large initial commit and realloc
as required (not per line, but per 1K lines maybe).

> +
> +  devmsg ("--reservoir_sampling--\n");
> +
> +  initbuffer (&line);
> +  while (readlinebuffer_delim (&line, in, eolbyte)!=NULL)

...

> +
> +  return MIN (k, n_lines);
> +}
> +

> @@ -341,8 +485,16 @@ main (int argc, char **argv)
>  
>        fadvise (stdin, FADVISE_SEQUENTIAL);
>  
> -      n_lines = read_input (stdin, eolbyte, &input_lines);
> -      line = input_lines;
> +      if (head_lines != SIZE_MAX)
> +        {
> +          use_reservoir_sampling = true;
> +          n_lines = SIZE_MAX; /* unknown number of input lines, for now */
> +        }
> +      else
> +        {
> +          n_lines = read_input (stdin, eolbyte, &input_lines);
> +          line = input_lines;
> +        }
>      }
>  
>    head_lines = MIN (head_lines, n_lines);
> @@ -352,6 +504,17 @@ main (int argc, char **argv)
>    if (! randint_source)
>      error (EXIT_FAILURE, errno, "%s", quotearg_colon (random_source));
>  
> +  if (use_reservoir_sampling)
> +    {
> +      /* Instead of reading the entire file into 'line',
> +         use reservoir-sampling to store just "head_lines" random lines. */
> +      n_lines = read_input_reservoir_sampling (stdin, eolbyte,
> +                                               &input_lines, head_lines,
> +                                               randint_source);
> +      line = input_lines;
> +      head_lines = MIN (head_lines, n_lines);

Is this MIN redundant as done within read_input_reservoir_sampling()?

thanks,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]