bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Absolute limits of rank 2 bool matrix size in GNU APL?


From: Dr . Jürgen Sauermann
Subject: Re: Absolute limits of rank 2 bool matrix size in GNU APL?
Date: Tue, 28 Dec 2021 18:34:12 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0

Hi Russ,

it has turned out to be very difficult to find a reliable way to figure the memory that is available for
a process like the GNU APL interpreter. One (of several) reasons for that is that GNU linux tells you
that it has more memory than it really has (so-called over-commitment). You can malloc() more memory
than you have and malloc will return a virtual memory (and no error) when you ask for it. However, if you
access the virtual memory at a later point in time (i.e. requesting real memory for it) then the process
may crash badly if that real memory is not available.

GNU APL deals with this by assuming (by default) that the total memory available for the GNU APL process is about 2 GB,

You can increase that amount (outside apl) be setting a larger memory limit, probably with:

ulimit --virtual-memory-size or ulimit -v (depending on platform).

in the shell or script before starting apl. However, note that:

* ulimit -v ulimited will not work because this is the default and GNU APL will apply its default of 2 GB in this case,
* the WS FULL behaviour of GNU APL (ans ⎕WA) will become unreliable. You must ensure that GNU APL will get
  as much memory as you promised it with ulimit, and
* all GNU APL cells have the same size (e.g. 24 byte on a 64-bit CPU, see apl -l37) even if the cells contain only
 Booleans.

The workaround for really large Boolean arrays is to pack them into the 64-bits of a GNU APL integer
and maybe use the bitwise functions (⊤∧, ⊤∨, ...) of GNU APL to access them group-wise.

Best Regards,
Jürgen


On 12/28/21 3:53 AM, Russtopia wrote:
Hi, doing some experiments in learning APL I was writing a word frequency count program that takes in a document, identifies unique words and then outputs the top 'N' occurring words.

The most straightforward solution, to me, seems to be ∘.≡ which works up to a certain dataset size. The main limiting statement in my program is

wordcounts←+⌿ (wl ∘.≡ uniqwords)

.. which generates a large boolean array which is then tallied up for each unique word.

I seem to run into a limit in GNU APL. I do not see an obvious ⎕SYL parameter to increase the limit and could not find any obvious reference in the docs either. What are the absolute max rows/columns of a matrix, and can the limit be increased? Are they separate or a combined limit?

      5 wcOuterProd 'corpus/135-0-5000.txt'    ⍝⍝ 5000-line document
Time: 26419 ms
  the   of   a and  to
 2646 1348 978 879 858
      ⍴wl
36564
      ⍴ uniqwords
5695

      5 wcOuterProd 'corpus/135-0-7500.txt'   ⍝⍝ 7500-line document
WS FULL+
wcOuterProd[8]  wordcounts←+⌿(wl∘.≡uniqwords)
                              ^             ^
      ⍴ wl
58666
      ⍴ uniqwords
7711


I have an iterative solution which doesn't use a boolean matrix to count the words, rather looping through using pick/take and so can handle much larger documents, but it takes roughy 2x the execution time.

Relating to this, does GNU APL optimize boolean arrays to minimize storage (ie., using larger bit vectors rather than entire ints per bool) and is there any clever technique other experience APLers could suggest to maintain the elegant 'loop-free' style of computing but avoid generating such large bool matrices? I thought of perhaps a hybrid approach where I iterate through portions of the data and do partial ∘.≡ passes but of course that complicates the algorithm.

[my 'outer product' and 'iterative' versions of the code are below]

Thanks,
-Russ

---
#!/usr/local/bin/apl --script
 ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
⍝                                                                    ⍝
⍝ wordcount.apl                        2021-12-26  20:07:07 (GMT-8)  ⍝
⍝                                                                    ⍝
 ⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝

⍝ function edif has ufun1 pointer 0!

∇r ← timeMs; t
  t ← ⎕TS
  r ← (((t[3]×86400)+(t[4]×3600)+(t[5]×60)+(t[6]))×1000)+t[7]


∇r ← lowerAndStrip s;stripped;mixedCase
 stripped ← '               abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
 mixedCase ← ⎕av[11],' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
 r ← stripped[mixedCase ⍳ s]


∇c ← h wcIterative fname
  ⍝⍝;D;WL;idx;len;word;wc;wcl;idx
  ⍝⍝ Return ⍒-sorted count of unique words in string vector D, ignoring case and punctuation
  ⍝⍝ @param h(⍺) - how many top word counts to return
  ⍝⍝ @param D(⍵) - vector of words
  ⍝⍝⍝⍝
  D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
  timeStart ← timeMs
  D ← (~ D ∊ ' ') ⊂ D ⍝ make into a vector of words
  WL ← ∪D
  ⍝⍝#DEBUG# ⎕ ← 'unique words:',WL
  wcl ← 0⍴0
  idx ← 1
  len ← ⍴WL
count:
  ⍝⍝#DEBUG# ⎕ ← idx
  →(idx>len)/done
  word ← ⊂idx⊃WL
  ⍝⍝#DEBUG# ⎕ ← word
  wc ← +/(word≡¨D)
  wcl ← wcl,wc
  ⍝⍝#DEBUG# ⎕ ← wcl
  idx ← 1+idx
  → count
done:
  c ← h↑[2] (WL)[⍒wcl],[0.5]wcl[⍒wcl]
  timeEnd ← timeMs
  ⎕ ← 'Time:',(timeEnd-timeStart),'ms'


∇r ← n wcOuterProd fname
  ⍝⍝ ;D;wl;uniqwords;wordcounts;sortOrder
  D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
  timeStart ← timeMs
  wl ← (~ D ∊ ' ') ⊂ D
  ⍝⍝#DEBUG# ⎕ ← '⍴ wl:', ⍴ wl
  uniqwords ← ∪wl
  ⍝⍝#DEBUG# ⎕ ← '⍴ uniqwords:', ⍴ uniqwords
  wordcounts ← +⌿(wl ∘.≡ uniqwords)
  sortOrder ← ⍒wordcounts
  r ← n↑[2] uniqwords[sortOrder],[0.5]wordcounts[sortOrder]
  timeEnd ← timeMs
  ⎕ ← 'Time:',(timeEnd-timeStart),'ms'





reply via email to

[Prev in Thread] Current Thread [Next in Thread]