help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers


From: Ben Abbott
Subject: Re: Import large field-delimited file with strings and numbers
Date: Sat, 06 Sep 2014 14:04:35 -0400

On Sep 6, 2014, at 10:19 AM, João Rodrigues <address@hidden> wrote:

> 
> I need to import a large CSV file with multiple columns with mixed string and 
> number entries, such as:
> 
> field1, field2, field3, field4
> A,        a,        1,       1.0,
> B,        b,        2,        2.0,
> C,        c,        3,        3.0,
> 
> and I want to pass this on to something like
> 
> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
> arr3 =[1 2 3]';
> arr4 =[1.0 2.0 3.0]';
> 
> furthermore, some columns can be ignored, the total number of entries is 
> known and there is a header.
> 
> How can I perform the import within reasonable time and little memory 
> overhead? Below are a few of my attempts.
> 
> Octave offers a wide range of functions to import files (csvread, dlmread, 
> textscan, textread, fscanf, fgetline) but as far as I can tell none seems to 
> get the job done.
> 
> csvread and dlmread don't work because they only handle numerical data.
> 
> textscan works eats up all the memory (the file is 200 MB, textscan's memory 
> usage was into the GB's). It doesn't allow to provide a priori the size of 
> the object.
> 
> fid = fopen(fstr,"r");
> [tmp] = textscan(fid,'%s  %s %d %d','delimiter', ',', 'headerlines', 1);
> fclose(fid);
> 
> fgetline allow to define the size of the object a priori but requires a loop:
> 
> v = cell(nrow,4);
> fid = fopen(fstr,"r");
> tmp = fgetl(fid);
> for irow = 1 : nrow
>    tmp = fgetl(fid);
>    v(irow,:) = strsplit(tmp,",");
> endfor
> fclose(fid);
> 
> Any suggestions? (I browsed google and the only suggestion I got was using 
> fgetl, but this is too slow. It takes 30sec to read 1% of the full dataset).
> 
> Thanks

Assuming your file is always as simple as your example ... you may get improved 
speed by just avoiding loops.

I copied your example to a file foo.txt and modified it to avoid comma's at the 
end of each line.

field1, field2, field3, field4
A,        a,        1,       1.0
B,        b,        2,        2.0
C,        c,        3,        3.0

The code below avoids loops.

str = strsplit ((fileread ('foo.txt')), '\n');
str(end) = []; % strip empty string due to newline at EOF
table = cellfun (@(str) strsplit (str, {' ',','}), str, 'uniformoutput', false);
table = vertcat (table(2:end){:})
cell1 = table(:,1);
cell2 = table(:,2);
arr3 = str2num (cell2mat (table(:,3)));
arr4 = str2num (cell2mat (table(:,4)));

Ben




reply via email to

[Prev in Thread] Current Thread [Next in Thread]