Hi, doing some
experiments in learning APL I was writing a word frequency
count program that takes in a document, identifies unique
words and then outputs the top 'N' occurring words.
The most
straightforward solution, to me, seems to be ∘.≡ which works
up to a certain dataset size. The main limiting statement in
my program is
wordcounts←+⌿ (wl ∘.≡
uniqwords)
.. which generates a
large boolean array which is then tallied up for each unique
word.
I seem to run into a
limit in GNU APL. I do not see an obvious ⎕SYL parameter to
increase the limit and could not find any obvious reference in
the docs either. What are the absolute max rows/columns of a
matrix, and can the limit be increased? Are they separate or a
combined limit?
5 wcOuterProd 'corpus/135-0-5000.txt' ⍝⍝ 5000-line
document
Time: 26419 ms
the of a and to
2646 1348 978 879 858
⍴wl
36564
⍴ uniqwords
5695
5 wcOuterProd 'corpus/135-0-7500.txt' ⍝⍝ 7500-line
document
WS FULL+
wcOuterProd[8] wordcounts←+⌿(wl∘.≡uniqwords)
^ ^
⍴ wl
58666
⍴ uniqwords
7711
I have an iterative
solution which doesn't use a boolean matrix to count the
words, rather looping through using pick/take and so can
handle much larger documents, but it takes roughy 2x the
execution time.
Relating to this, does
GNU APL optimize boolean arrays to minimize storage (ie.,
using larger bit vectors rather than entire ints per bool) and
is there any clever technique other experience APLers could
suggest to maintain the elegant 'loop-free' style of computing
but avoid generating such large bool matrices? I thought of
perhaps a hybrid approach where I iterate through portions of
the data and do partial ∘.≡ passes but of course that
complicates the algorithm.
[my 'outer product' and
'iterative' versions of the code are below]
Thanks,
-Russ
---
#!/usr/local/bin/apl
--script
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝
⍝
⍝ wordcount.apl 2021-12-26 20:07:07
(GMT-8) ⍝
⍝
⍝
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝ function edif has ufun1 pointer 0!
∇r ← timeMs; t
t ← ⎕TS
r ←
(((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7]
∇
∇r ← lowerAndStrip s;stripped;mixedCase
stripped ← '
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ← ⎕av[11],'
,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
r ← stripped[mixedCase ⍳ s]
∇
∇c ← h wcIterative fname
⍝⍝;D;WL;idx;len;word;wc;wcl;idx
⍝⍝ Return ⍒-sorted count of unique words in string vector
D, ignoring case and punctuation
⍝⍝ @param h(⍺) - how many top word counts to return
⍝⍝ @param D(⍵) - vector of words
⍝⍝⍝⍝
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text
with newlines
timeStart ← timeMs
D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of words
WL ← ∪D
⍝⍝#DEBUG# ⎕ ← 'unique words:',WL
wcl ← 0⍴0
idx ← 1
len ← ⍴WL
count:
⍝⍝#DEBUG# ⎕ ← idx
→(idx>len)/done
word ← ⊂idx⊃WL
⍝⍝#DEBUG# ⎕ ← word
wc ← +/(word≡¨D)
wcl ← wcl,wc
⍝⍝#DEBUG# ⎕ ← wcl
idx ← 1+idx
→ count
done:
c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl]
timeEnd ← timeMs
⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇
∇r ← n wcOuterProd fname
⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text
with newlines
timeStart ← timeMs
wl ← (~ D ∊ ' ') ⊂ D
⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl
uniqwords ← ∪wl
⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords
wordcounts ← +⌿(wl ∘.≡ uniqwords)
sortOrder ← ⍒wordcounts
r ← n↑[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
timeEnd ← timeMs
⎕ ← 'Time:',(timeEnd-timeStart),'ms'
∇