Re: [Bug-apl] Performance problems when constructing large(ish) arrays

From:

Juergen Sauermann

Subject:

Date:

Wed, 18 Jan 2017 14:46:41 +0100

User-agent:

Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

Hi,

as a start I have added ⎕FIO[49] in SVN 851. It reads an entire UTF8 encoded file and puts every line of the
file into one nested Item of the result. Trailing CR and LF are being removed in the precess.

Next step is to turn ⎕FIO[49] into an operator so that you can give it an APL function that converts every line into the
desired result. Until then you can use it like:

Z←CONVERT¨Z←⎕FIO[49] 'filename'

/// Jürgen

On 01/18/2017 11:17 AM, Elias Mårtenson wrote:

You've all made good points, and I changed the code slightly to provide the initial array side in order to avoid the recreation of the array on each iteration. This brought down the loading time to a much more bearable 14 seconds. I rewrote the Lisp code to be compatible with the APL code and the time was 1.46 seconds. This suggests that GNU APL is consistently about 10 times slower than non-optimised Lisp code. To me, this is not unexpected given the fact that GNU APL isn't designed to be high-performance.

However, while 14 seconds for 30k is manageable, I have had the need to work with arrays of over a million rows. Extrapolating this suggests that it would take almost 8 minutes to load such a file. Thus, unless GNU APL can magically improve overall performance by at least 10 times, I still think we need a native CSV loading function.

Regards,

Elias

For reference, here is the APL code:

∇Z ← type convert_entry value
→('n'≡type)/numeric
→('s'≡type)/string
⎕ES 'Illegal conversion type'
numeric:
Z←⍎value
→end
string:
Z←value
end:
∇

∇Z ← pattern read_csv_n[n] filename ;fd;line;separator;i
separator ← ' '
Z ← n (↑⍴pattern) ⍴ 0
fd ← 'r' FIO∆fopen filename
i ← ⎕IO

next:
line ← FIO∆fgets fd           ⍝ Read one line from the file
→(⍬≡line)/end
→(10≠line[⍴line])/skip_nl     ⍝ If the line ends in a newline
line ← line[⍳¯1+⍴line]        ⍝ Remove the newline
skip_nl:
line ← ⎕UCS line
Z[i;] ← pattern convert_entry¨ (line≠separator) ⊂ line
i ← i+1
→next
end:

FIO∆fclose fd
∇

And here is the Lisp code (the test case was running on SBCL), requires the QL packages SPLIT-SEQUENCE and PARSE-NUMBER:

(defparameter *result*
           (time
            (with-open-file (s "apjs492452t1_mrt.txt")
              (let ((res (make-array '(34030 11))))
                (dotimes (i (array-dimension res 0))
                  (let* ((line (read-line s))
                         (parts (split-sequence:split-sequence #\Space line :remove-empty-subseqs t)))
                    (loop
                      for ii from 0 below 10
                      for p in parts
                      do (setf (aref res i ii) (parse-number:parse-number p)))
                    (setf (aref res i 10) (nth 10 parts))))
                res))))

On 18 January 2017 at 09:57, Blake McBride <address@hidden> wrote:

On Tue, Jan 17, 2017 at 7:39 PM, Xiao-Yong Jin <address@hidden> wrote:

I always feel GNU APL kind of slow compared to Dyalog, but I never really compared two in large dataset.
I'm mostly using J now for large dataset.
If Elias has the optimized code for GNU APL and a reproducible way to measure timing, I'd like to compare it with Dyalog and J.

I think that's actually a good idea. It would be a good comparison. It would really make it clear if there is a blaring problem. But first the APL code should be optimized a bit (but nothing crazy like reading it all into memory right now.)

--blake