Re: Use R to manage results from GNU Parallel

Is it possible to do an automatic fall back onto, say, read.csv if
data.table or plyr is not installed?

It could certainly be done... There are the apply functions, and especially tapply. There's also "by", but it's pretty slow. But I don't think any of these is quite a drop-in replacement.

Another approach, which would probably be fast, but wouldn't have the nice automatic column type casting and error protection and such, would be to read a single nonempty file to figure out how many columns there are, and then just split all the strings on all the separator characters and reform them into matrices by row. Combine this with something like the approach used for the newline version to generate a final matrix. I think something like that could work.

Why:
rownames(raw) = 1:nrow(raw)

Why not:
rownames(raw) = NULL

That line can be removed entirely. It's a remnant from another version of the function.

> 2) When stdout is empty, I don't include any entries. Another possibility
> would be to include NAs, but that would take a few more lines of code.

I am not sure what the correct R approach is. The UNIX approach would
no entries. So only if there is an R tradition of returning NAs should
you consider changing that.

Eh... if you want to represent missing data, you typically use NAs. But I don't think it's necessary in this case. It wouldn't be hard to put the NAs back in after the fact.

One of the things that convince me is reproducible
measurements/timings. I have too many times been tricked by the common
wisdom that used to be true, but which no longer is true (Recent
example UUOC: http://oletange.blogspot.dk/2013/10/useless-use-of-cat.html).

Yes, I've run that experiment as well. I've also yet to see any speedup from writing a big temporary file to "ram disk" rather than /tmp, and then reading it back in, even when the file is pretty big. So I agree that testing is the only way to go.

My gut feeling is that if the data is not in disk cache, then disk I/O
will be the limiting factor, but I would love to see numbers to
(dis)prove this.

Maybe I'll have some spare time for this later in the week... The initial 'out-of-R' approach I had in mind would start an awk program for every file. awk starts pretty fast, but if this is scaling to a million files, that's probably not a great approach. So the awk script gets more complicated...

I have commented the code and checked it in:

git clone git://git.savannah.gnu.org/parallel.git

Great.

David

/Ole

From:	David Rosenberg
Subject:	Re: Use R to manage results from GNU Parallel
Date:	Mon, 6 Jan 2014 00:26:48 -0500