Hi, doing
some experiments in learning APL I was writing a
word frequency count program that takes in a
document, identifies unique words and then outputs
the top 'N' occurring words.
The most
straightforward solution, to me, seems to be ∘.≡
which works up to a certain dataset size. The main
limiting statement in my program is
wordcounts←+⌿
(wl ∘.≡ uniqwords)
.. which
generates a large boolean array which is then
tallied up for each unique word.
I seem to
run into a limit in GNU APL. I do not see an obvious
⎕SYL parameter to increase the limit and could not
find any obvious reference in the docs either. What
are the absolute max rows/columns of a matrix, and
can the limit be increased? Are they separate or a
combined limit?
5 wcOuterProd
'corpus/135-0-5000.txt' ⍝⍝ 5000-line document
Time: 26419 ms
the of a and to
2646 1348 978 879 858
⍴wl
36564
⍴ uniqwords
5695
5 wcOuterProd
'corpus/135-0-7500.txt' ⍝⍝ 7500-line document
WS FULL+
wcOuterProd[8] wordcounts←+⌿(wl∘.≡uniqwords)
^ ^
⍴ wl
58666
⍴ uniqwords
7711
I have an
iterative solution which doesn't use a boolean
matrix to count the words, rather looping through
using pick/take and so can handle much larger
documents, but it takes roughy 2x the execution
time.
Relating
to this, does GNU APL optimize boolean arrays to
minimize storage (ie., using larger bit vectors
rather than entire ints per bool) and is there any
clever technique other experience APLers could
suggest to maintain the elegant 'loop-free' style of
computing but avoid generating such large bool
matrices? I thought of perhaps a hybrid approach
where I iterate through portions of the data and do
partial ∘.≡ passes but of course that complicates
the algorithm.
[my 'outer
product' and 'iterative' versions of the code are
below]
Thanks,
-Russ
---
#!/usr/local/bin/apl
--script
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝
⍝
⍝ wordcount.apl 2021-12-26
20:07:07 (GMT-8) ⍝
⍝
⍝
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝ function edif has ufun1 pointer 0!
∇r ← timeMs; t
t ← ⎕TS
r ←
(((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7]
∇
∇r ← lowerAndStrip s;stripped;mixedCase
stripped ← '
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ← ⎕av[11],'
,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
r ← stripped[mixedCase ⍳ s]
∇
∇c ← h wcIterative fname
⍝⍝;D;WL;idx;len;word;wc;wcl;idx
⍝⍝ Return ⍒-sorted count of unique words in
string vector D, ignoring case and punctuation
⍝⍝ @param h(⍺) - how many top word counts to
return
⍝⍝ @param D(⍵) - vector of words
⍝⍝⍝⍝
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝
raw text with newlines
timeStart ← timeMs
D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of
words
WL ← ∪D
⍝⍝#DEBUG# ⎕ ← 'unique words:',WL
wcl ← 0⍴0
idx ← 1
len ← ⍴WL
count:
⍝⍝#DEBUG# ⎕ ← idx
→(idx>len)/done
word ← ⊂idx⊃WL
⍝⍝#DEBUG# ⎕ ← word
wc ← +/(word≡¨D)
wcl ← wcl,wc
⍝⍝#DEBUG# ⎕ ← wcl
idx ← 1+idx
→ count
done:
c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl]
timeEnd ← timeMs
⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇
∇r ← n wcOuterProd fname
⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝
raw text with newlines
timeStart ← timeMs
wl ← (~ D ∊ ' ') ⊂ D
⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl
uniqwords ← ∪wl
⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords
wordcounts ← +⌿(wl ∘.≡ uniqwords)
sortOrder ← ⍒wordcounts
r ← n↑[2]
uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
timeEnd ← timeMs
⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇