bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Performance problems when constructing large(ish) arrays


From: Nick Lobachevsky
Subject: Re: [Bug-apl] Performance problems when constructing large(ish) arrays
Date: Tue, 17 Jan 2017 19:33:38 +0100

Elias,

In just about any language, iterative string catenation is a real
performance killer owing to the repeated growing memory allocation
which just gets bigger with every iteration.

I would restructure the algorithm to work more like this:  (sorry, on
a Windoze box and no APL, so metacode only)

Open the file, read the contents of the entire file, close the file
Change NL to separator, then partition the whole thing (Z)
Check to see that the length of the partitioned string divides evenly
by pattern length
Create a temporary of only the numeric items.  As they should all be
scalar, you should be able to do it in a single execute and get a
numeric vector of the correct length.
Assign the numeric vector back into Z.  You may want to do a ravel
each if you would rather have one element vectors
Reshape Z to have the same number columns as in pattern

Dyalog and some other APLs (APLX?) have a []CSV function.  The problem
gets a bit harder when you have optional quotes around strings,
embedded separators, and have to figure out yourself whether something
is numeric or char.



On 1/17/17, Elias Mårtenson <address@hidden> wrote:
> I wanted to use GNU APL to work on a dataset of star data. The file
> consists of 34030 lines of the following form:
>
>   892376 3813 4.47 0.4699  1.532  0.007    7306.69 0.823 0.4503 0 ---
>  1026146 4261 4.57 0.6472 14.891  0.12    11742.56 1.405 0.7229 0 ---
>  1026474 4122 4.56 0.5914  1.569  0.006   30471.8  1.204 0.6061 0 ---
>  1162635 3760 4.77 0.4497 15.678  0.019   10207.47 0.978 0.5445 1 ---
>
> I wrote a generic CSV loader to handle this (source code at the end of this
> email), and loaded the data like so:
>
> *    z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'*
>
> This took many minutes to load, which in my opinion shouldn't happen.
>
> Now, I have a few questions:
>
>
>    1. Is there a way to speed up this code?
>    2. Is there something that could be done on the GNU APL implementation
>    side to make this faster?
>    3. Shouldn't we have a generic ⎕CSV function or something like that
>    which would be able to load CSV files in milliseconds regardless of
> size?
>    This should be trivial to do in C++.
>
> Here's the code in question:
>
> ∇Z ← type convert_entry value
>   →('n'≡type)/numeric
>   →('s'≡type)/string
>   ⎕ES 'Illegal conversion type'
> numeric:
>   Z←⍎value
>   →end
> string:
>   Z←value
> end:
>
>
> ∇Z ← pattern read_csv filename ;fd;line;separator
>   separator ← ' '
>   Z ← 0 (↑⍴pattern) ⍴ ⍬
>   fd ← 'r' FIO∆fopen filename
>
> next:
>   line ← FIO∆fgets fd           ⍝ Read one line from the file
>   →(⍬≡line)/end
>   →(10≠line[⍴line])/skip_nl     ⍝ If the line ends in a newline
>   line ← line[⍳¯1+⍴line]        ⍝ Remove the newline
> skip_nl:
>   line ← ⎕UCS line
>   Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line
>   →next
> end:
>
>   FIO∆fclose fd
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]