coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] shuf: use reservoir-sampling when possible


From: Assaf Gordon
Subject: [PATCH] shuf: use reservoir-sampling when possible
Date: Wed, 06 Mar 2013 18:50:30 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4

Hello,

Attached is a suggestion to implement reservoir-sampling in shuf:
When the expected output of lines is known, it will not load the entire file 
into memory - allowing shuffling very large inputs.

I've seen this mentioned once:
 http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html
but no follow-up discussion.

There is no change in the usage of shuf (barring unexpected bugs...).

Example (with debug messages):
===
  $ seq 10000 | ./src/shuf ---debug -n 5
  --reservoir_sampling--
  filling reservoir, input line 1 of 5: '1'
  filling reservoir, input line 2 of 5: '2'
  filling reservoir, input line 3 of 5: '3'
  filling reservoir, input line 4 of 5: '4'
  filling reservoir, input line 5 of 5: '5'
  Replacing reservoir sample 4 with line 7 '7'
  Replacing reservoir sample 4 with line 8 '8'
  Replacing reservoir sample 3 with line 9 '9'
  Replacing reservoir sample 2 with line 10 '10'
  Replacing reservoir sample 4 with line 11 '11'
  Replacing reservoir sample 4 with line 16 '16'
  Replacing reservoir sample 4 with line 17 '17'
  Replacing reservoir sample 4 with line 20 '20'
  Replacing reservoir sample 2 with line 22 '22'
  Replacing reservoir sample 0 with line 31 '31'
  Replacing reservoir sample 1 with line 52 '52'
  Replacing reservoir sample 4 with line 55 '55'
  Replacing reservoir sample 3 with line 61 '61'
  Replacing reservoir sample 4 with line 76 '76'
  Replacing reservoir sample 2 with line 169 '169'
  Replacing reservoir sample 2 with line 187 '187'
  Replacing reservoir sample 0 with line 216 '216'
  Replacing reservoir sample 1 with line 340 '340'
  Replacing reservoir sample 4 with line 431 '431'
  Replacing reservoir sample 1 with line 524 '524'
  Replacing reservoir sample 2 with line 942 '942'
  Replacing reservoir sample 1 with line 1096 '1096'
  Replacing reservoir sample 2 with line 1627 '1627'
  Replacing reservoir sample 4 with line 1763 '1763'
  Replacing reservoir sample 2 with line 2679 '2679'
  Replacing reservoir sample 3 with line 4382 '4382'
  Replacing reservoir sample 2 with line 4439 '4439'
  Replacing reservoir sample 3 with line 7748 '7748'
  Replacing reservoir sample 2 with line 9902 '9902'
  -- reservoir lines (begin)--
  216
  1096
  9902
  7748
  1763
  -- reservoir lines (end)--
  216
  1763
  7748
  1096
  9902
===

The last 5 lines are the final output (the rest is STDERR debug messages).
After the input is read completely, the lines are still re-permuted (using the 
existing shuf code), to accommodate cases like:

===
  $ seq 6 | ./src/shuf ---debug -n 5
  --reservoir_sampling--
  filling reservoir, input line 1 of 5: '1'
  filling reservoir, input line 2 of 5: '2'
  filling reservoir, input line 3 of 5: '3'
  filling reservoir, input line 4 of 5: '4'
  filling reservoir, input line 5 of 5: '5'
  Replacing reservoir sample 2 with line 6 '6'
  -- reservoir lines (begin)--
  1
  2
  6
  4
  5
  -- reservoir lines (end)--
  4
  2
  1
  6
  5
===


Comments are welcomed,
 -gordon

Attachment: 0001-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]