[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Frederik Eaton |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Sat, 4 Jun 2005 20:19:47 -0700 |
User-agent: |
Mutt/1.5.9i |
On Fri, Jun 03, 2005 at 06:27:18AM +0000, Davis Houlton wrote:
> > One possibility for an efficient random permutation of a large file is
> > a program which scans the file once and records the location of each
> > line in an index, then applies a uniformly distributed random
> > permutation to this line index by stepping through it in order and
> > swapping the current entry with an entry chosen at random from the set
> > of later entries, and finally steps through the index again and
> > dereferences each line entry and prints out the corresponding line in
> > the original file.
>
> Interesting points Frederik! Glossing over some trivial special case details
> and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm
> supporting N sized data sets might look like....
>
> struct index
> index is a structure that allows us to fseek and read a given segement of
> data
> from the given file into memory
> file
> start
> len
>
> calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but
> in temp files 0...X, and each temp file has a maximum amount of index
> structures INDEX_MAX. This function, given a temp file counter, calculates
> the number of indexes in all previous temp files. I'm using this here
> primarily as a convention to simplify the psuedo-code below.
I don't see why you wouldn't just use one file.
> random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our
> random number generator covers the entire data set.
>
> then the shuffle routine:
> while input_size>=0
> random_line=random(input_size)
>
> //Figure out which temp files have our data
> input_size_file=input_size/INDEX_MAX
> random_line_file=random_line/INDEX_MAX
>
> //Figure out how many actual indexes we need to skip forward
> //For example, if we can store 10 indexes per temp file, and we
> //want index 18, we'd only skip over 7 lines in file 1.
> input_size_pos =
> sizeof(struct index) *
> (input_size-calc_lines(input_size_file-1))
> random_line_pos=
> sizeof(struct index)*
> (random_line-calc_lines(random_line_file-1))
>
> //Grab last index
> open file input_size_file
> seek forward input_size_pos
> read in struct index last_line_index
> close input_size_file
>
> //Grab random line index, and replace
> //With last index
> open file random_line_file
> seek forward random_line_pos
> read in struct index random_line_index
> seek backward sizeof(struct index)
> write last_line_index
> close random_line_file
>
> //Write out random line
> open file index.source
> seek index.begin
> print (read in index.len bytes)
> close file index.source
>
> input_size--
>
> The primary advantage above is only two passes are made; one to create the
> indexes, and one to randomize the output. Even for very large files, I don't
> know what the end efficiency is (I suspect not much), but I guess every
> little bit helps.
I didn't check the code but you're right that only two passes are
necessary. It's almost certainly a lot more efficient than 'sort
--random' would be.
> Obviously, there are some gross simplifications...for example, most numbers
> would be composed of two longs; I'm assuming we'll have to deal with line
> counts > than LONG_MAX, so each line position would be composed of one long
> for a page offset (for files > LONG_MAX in data-set size) and one for the
> actual offset (when on the current page). That may be overkill, and if it is,
> using just longs would greatly simplify implementation. Even then, we can't
> handle infinitely large files--is that a true requirement?
See /usr/include/features.h. If you define _LARGEFILE_SOURCE then you
get functions for manipulating files with 64-bit sizes and offsets. 64
bits is definitely enough.
Frederik
--
http://ofb.net/~frederik/
- Re: new coreutil? shuffle - randomize file contents, (continued)
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, James Youngman, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents,
Frederik Eaton <=
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/06
- Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/06/07
Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/06/02
Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/02