parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use R to manage results from GNU Parallel


From: Ole Tange
Subject: Re: Use R to manage results from GNU Parallel
Date: Mon, 6 Jan 2014 03:12:21 +0100

On Sun, Jan 5, 2014 at 9:43 PM, David Rosenberg <david.davidr@gmail.com> wrote:

> So here are two versions for reading into a data.frame.  The first one
> actually reads into a data.table, and uses a data.table approach.  You can
> convert to a data.frame with as.data.frame.  The second approach uses plyr
> and returns a data.frame.  The data.table approach is much faster than the
> plyr approach, at least for the data I was testing on. (And in my experience
> this is always the case.)

Both the data.table and plyr require a library that is not installed
by default. It is strictly forbidden in the core of GNU Parallel
(since GNU Parallel is actively being used on equipment and software
you would long have scrapped), but since this is for post-processing I
tend to be somewhat more relaxed - especially if we cannot find a
default library that will work.

Is it possible to do an automatic fall back onto, say, read.csv if
data.table or plyr is not installed?

> Here's some trickiness that I've accounted for, and likely there are still
> subtle things I'm not handling correctly:
>
> 1) For the data.table approach, I use fread, rather than read.table, because
> it's much faster.  It tries to figure out whether you passed it a filename
> or a string to read from.  To ensure that it knows it's reading a string, I
> append a newline to the end of every stdout.

Good call: stdout could just contain a string with no newline.

Why:
  rownames(raw) = 1:nrow(raw)

Why not:
  rownames(raw) = NULL

> 2) When stdout is empty, I don't include any entries.  Another possibility
> would be to include NAs, but that would take a few more lines of code.

I am not sure what the correct R approach is. The UNIX approach would
no entries. So only if there is an R tradition of returning NAs should
you consider changing that.

> 5) One should be able to specify the separator character using the ...
> parameters, which are passed on to fread and read.table

Yep.

> 6) I still think the speediest solution would be to put all the data
> together outside of R, and then read it in with a single read.table from a
> pipe, or an fread from a temporary file.

One of the things that convince me is reproducible
measurements/timings. I have too many times been tricked by the common
wisdom that used to be true, but which no longer is true (Recent
example UUOC: http://oletange.blogspot.dk/2013/10/useless-use-of-cat.html).

To see how much the penalty is, we should look at (at least) the
following variables:

* Data in disk cache: yes/no
* Number of files - We should optimize for a situation with 100k-1m files.
* Number of subdirs - I would think most will have less than 1000
values per parameter
* Size of files - If we can deal with 1GB filesize I would say we are covered.

My gut feeling is that if the data is not in disk cache, then disk I/O
will be the limiting factor, but I would love to see numbers to
(dis)prove this.

I have commented the code and checked it in:

  git clone git://git.savannah.gnu.org/parallel.git

/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]