bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Frederik Eaton
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Sat, 4 Jun 2005 20:19:47 -0700
User-agent: Mutt/1.5.9i

On Fri, Jun 03, 2005 at 06:27:18AM +0000, Davis Houlton wrote:
> > One possibility for an efficient random permutation of a large file is
> > a program which scans the file once and records the location of each
> > line in an index, then applies a uniformly distributed random
> > permutation to this line index by stepping through it in order and
> > swapping the current entry with an entry chosen at random from the set
> > of later entries, and finally steps through the index again and
> > dereferences each line entry and prints out the corresponding line in
> > the original file.
> 
> Interesting points Frederik! Glossing over some trivial special case details 
> and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm 
> supporting N sized data sets might look like....
> 
> struct index 
> index is a structure that allows us to fseek and read a given segement of 
> data 
> from the given file into memory
>     file
>     start
>     len
> 
> calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but 
> in temp files 0...X, and each temp file has a maximum amount of index 
> structures INDEX_MAX.  This function, given a temp file counter, calculates 
> the number of indexes in all previous temp files. I'm using this here 
> primarily as a convention to simplify the psuedo-code below.

I don't see why you wouldn't just use one file.

> random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our 
> random number generator covers the entire data set.  
> 
> then the shuffle routine:
> while input_size>=0
>    random_line=random(input_size)
> 
>    //Figure out which temp files have our data
>    input_size_file=input_size/INDEX_MAX
>    random_line_file=random_line/INDEX_MAX
> 
>    //Figure out how many actual indexes we need to skip forward
>    //For example, if we can store 10 indexes per temp file, and we 
>    //want index 18, we'd only skip over 7 lines in file 1.
>    input_size_pos = 
>        sizeof(struct index) *
>        (input_size-calc_lines(input_size_file-1))
>    random_line_pos=
>        sizeof(struct index)*
>        (random_line-calc_lines(random_line_file-1))
> 
>   //Grab last index
>    open file input_size_file
>    seek forward input_size_pos
>    read in struct index last_line_index
>    close input_size_file
> 
>    //Grab random line index, and replace
>    //With last index
>    open file random_line_file
>    seek forward random_line_pos 
>    read in struct index random_line_index
>    seek backward sizeof(struct index)
>    write last_line_index
>    close random_line_file
> 
>    //Write out random line 
>    open file index.source
>    seek index.begin
>    print (read in index.len bytes)
>    close file index.source
>    
>    input_size--
> 
> The primary advantage above is only two passes are made; one to create the 
> indexes, and one to randomize the output. Even for very large files, I don't 
> know what the end efficiency is (I suspect not much), but I guess every 
> little bit helps.

I didn't check the code but you're right that only two passes are
necessary. It's almost certainly a lot more efficient than 'sort
--random' would be.

> Obviously, there are some gross simplifications...for example, most numbers 
> would be composed of two longs; I'm assuming we'll have to deal with line 
> counts > than LONG_MAX, so each line position would be composed of one long 
> for a page offset (for files > LONG_MAX in data-set size) and one for the 
> actual offset (when on the current page). That may be overkill, and if it is, 
> using just longs would greatly simplify implementation.  Even then, we can't 
> handle infinitely large files--is that a true requirement?

See /usr/include/features.h. If you define _LARGEFILE_SOURCE then you
get functions for manipulating files with 64-bit sizes and offsets. 64
bits is definitely enough.

Frederik

-- 
http://ofb.net/~frederik/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]