bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Spell corrector - APL


From: Juergen Sauermann
Subject: Re: [Bug-apl] Spell corrector - APL
Date: Mon, 12 Sep 2016 12:10:43 +0200
User-agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Hi Ala'a,

you can use ⎕FIO ¯1 to find out where the time is spent, e.g.:

T←⎕FIO ¯1
file ← 'test.txt'
'T1:' ((T←⎕FIO ¯1)-T)
⎕ ← ⍴w ← words ftxt file
'T2:' ((T←⎕FIO ¯1)-T)
⎕ ← ⍴u ← ∪w
'T3:' ((T←⎕FIO ¯1)-T)
desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
'T4:' ((T←⎕FIO ¯1)-T)

Your downcase function fails on my machine:

      ⎕ ← ⍴w ← words ftxt file
INDEX ERROR+
λ1[1]  λ←(a,⎕AV)[(A,⎕AV)⍳⍵]
         ^      ^

      )MORE
⎕IO=1 offending index=282 max index=282


probably due to a character in my testfile that is not contained in ⎕AV.
You should use ⎕UCS instead of ⎕AV to avoid that:

      downcase←{ ⎕UCS (32×(T≥65)∧T≤90)+⍵←⎕UCS ⍵ }

/// Jürgen


On 09/11/2016 08:23 PM, Ala'a Mohammad wrote:
Just an update as a reference, I'm now able to parse the big.txt file
(without WS full or killed process), but it takes around 2 Hours and
20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
process reach 1GiB (after parsing the words), and tops that with
100MiB during the sequential 'Each' (thus a max of 1.1GiB).

The only change is scanning each unique word against the whole words vector.

Below is the code with a sample timed run.

Regards,

Ala'a

⍝ fhist.apl
a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }

file ← '/misc/big.txt' ⍝ ~ 6.2M
⎕ ← ⍴w ← words ftxt file
⎕ ← ⍴u ← ∪w
desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
)OFF

: time apl -s -f fhist.apl
1098281
30377
 the            80003
 of             40025
 to             28760
 in             22048
 for             6936
 by              6736
 be              6154
 or              5349
 all             4141
 this            4058
 are             3627
 other           1488
 before          1363
 should          1297
 over            1282
 your            1276
 any             1204
 our             1065
 holmes           450
 country          417
 world            355
 project          286
 gutenberg        262
 laws             233
 sir              176
 series           128
 sure             123
 sherlock         101
 ebook             85
 copyright         69
 changing          44
 check             38
 arthur            30
 adventures        17
 redistributing     7
 header             7
 doyle              5
 downloading        5
 conan              4

apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total

On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <address@hidden> wrote:
Thanks to all for the input,

Replacing Find and Each OR with Match helped, now I'm parsing a 159K
(~1545 lines) text file (a sample chunk from the big.txt).

The strange thing for me that I'm trying to understand is that the APL
process (when fed the 159K text file) start allocating memory until it
reaches 2.7GiB, then after printing the result settle down to 50MiB.
Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
collection utility) which can be used to mitigate this issue?

Here is the updated code:

a ← 'abcdefghijklmnopqrstuvwxyz'
A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
alphamask ← { ~ ⍵ ∊ nonalpha }
words ← { (alphamask ⍵) ⊂ downcase ⍵ }
hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
desc ← {⍵[⍒⍵[;2];]}
ftxt ← { ⎕FIO[26] ⍵ }
fhist ← { hist words ftxt ⍵ }

file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
⎕ ← ⍴w ← words ftxt file
⎕ ← ⍴u ← ∪w
desc 39 2 ⍴ fhist file

And here is a sample run
: apl -s -f fhist.apl
30186
4155
 the            1560
 to              804
 of              781
 in              493
 for             219
 be              173
 holmes          164
 your            132
 this            114
 all              99
 by               97
 are              97
 or               73
 other            56
 over             51
 our              48
 should           47
 before           43
 sherlock         39
 any              35
 sir              26
 sure             13
 country           9
 project           6
 gutenberg         6
 ebook             5
 adventures        5
 world             5
 arthur            4
 conan             4
 doyle             4
 series            2
 copyright         2
 laws              2
 check             2
 header            2
 changing          1
 downloading       1
 redistributing    1

Also attached the sample input file

Regards,

On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <address@hidden> wrote:
On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
the errors happened inside 'hist' function, and I presume mostly due
to the jot dot find (if understand correctly, operating on a matrix of
length equal to : unique-length * words-length)
Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.

-k



reply via email to

[Prev in Thread] Current Thread [Next in Thread]