parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use R to manage results from GNU Parallel


From: Ole Tange
Subject: Re: Use R to manage results from GNU Parallel
Date: Sun, 5 Jan 2014 16:38:08 +0100

On Sun, Jan 5, 2014 at 2:13 PM, David Rosenberg <david.davidr@gmail.com> wrote:
>> But I would appreciate help with:
>>
>>   load_parallel_results_split_on_newline(filenametable)

I have this working now. See below.

>>   load_parallel_results_split_to_columns(filenametable)
>
> I'm happy to write these, though I'm limited on time.  Could you could write
> a generator for test data?

parallel --results my/results/dir --header : echo FOO={foo}
BAR={bar}';'seq {bar} :::: <(echo foo; seq 1000) <(echo bar; seq 10)

> R has limited options for reading data with a non-newline record separator
> characters. My first approach here would be to pipe the data through  tr or
> sed to swap the desired record separator character with "\n", so that we can
> read things into R with the usual commands.  I'm assuming we're on a POSIX
> system, or something where we can do that.  Otherwise, I think we'd have to
> read each file as a giant string (as you're doing for 'raw'), and then parse
> things ourselves, which I'd suspect would be much slower.

I do not like the idea of shelling out simply to read a file. If we
are talking tons of small files then spawning a shell will slow it
down tremendously.

I read that anything you can do on a connection (i.e. R's filehandle)
you can also do on a string using textConnection. So I would suggest
we make an efficient raw reader and use that and then use
a=sub(newlinesep,"\n",a) to replace newline/tab and finally use R's
builtin reader on a textConnection.

> BTW, for 'raw', it might be worth comparing the performance of using
> readLines, followed by collapsing the newlines, to the following approach:
>
> readChar(fileName, file.info(fileName)$size)

Good call. It is way easier to read, so even if the performance is the
same I would still use it.


/Ole

load_parallel_results_filenames <- function(resdir) {
  ## Find files called .../stdout
  stdoutnames <- list.files(path=resdir, pattern="stdout", recursive=T);
  ## Find files called .../stderr
  stderrnames <- list.files(path=resdir, pattern="stderr", recursive=T);
  if(length(stdoutnames) == 0) {
    ## Return empty data frame if no files found
    return(data.frame());
  }
  m <- matrix(unlist(strsplit(stdoutnames, "/")),nrow =
length(stdoutnames),byrow=T);
  filenametable <- as.table(m[,c(F,T)]);
  ## Append the stdout and stderr filenames
  filenametable <- cbind(filenametable,
                         paste(resdir,unlist(stdoutnames),sep="/"),
                         paste(resdir,unlist(stderrnames),sep="/"));
  colnames(filenametable) <-
c(strsplit(stdoutnames[1],"/")[[1]][c(T,F)],"stderr");
  return(filenametable);
}

load_parallel_results_raw <- function(filenametable) {
  ## Read the files given in column stdout
  stdoutcontents <-
    lapply(filenametable[,c("stdout")],
           function(filename) {
             return(readChar(filename, file.info(filename)$size));
           } );
  ## Read the files given in column stderr
  stderrcontents <-
    lapply(filenametable[,c("stderr")],
           function(filename) {
             return(readChar(filename, file.info(filename)$size));
           } );
  ## Replace filenames with file contents
  filenametable[,c("stdout","stderr")] <-
    c(as.character(stdoutcontents),as.character(stderrcontents));
  return(filenametable);
}

load_parallel_results_split_on_newline <- function(filenametable) {
  raw <- load_parallel_results_raw(filenametable);

  arg_indexes <- 1:(dim(raw)[1]-2);
  return(t(as.data.frame(row.names=c(""),
    apply(raw, 1, function(row) {
      return(sapply(unlist(strsplit(row[c("stdout")], "\n")),
                    function(line) {
                      return(c(row[arg_indexes], line));
                    }
                    ));
    })
    )));
}



reply via email to

[Prev in Thread] Current Thread [Next in Thread]