bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Spell corrector - APL


From: Xiao-Yong Jin
Subject: Re: [Bug-apl] Spell corrector - APL
Date: Fri, 9 Sep 2016 22:58:23 -0500

Seems like a good motivation to support quad equal: ⌸
See the key operator in dyalog:
http://help.dyalog.com/15.0/Content/Language/Primitive%20Operators/Key.htm

On the other hand, pattern matching A[n]←x for
in-place operation seems a good way to go.
Not sure if it’s possible in GNU APL.

> On Sep 9, 2016, at 10:27 PM, Christian Robert <address@hidden> wrote:
> 
> 
> I got to may be 2% of the work with this:
> 
> alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
> remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
> tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}
> 
> 
> 
>      )sic
>      )erase readfile_fast
>      ∇z←readfile_fast name;fd;lines;⎕io
> ⎕io←1 ⍝ Bring a file into a vector of strings, utf8 aware for both name and 
> contents.
> →(0≠"r" ⎕fio[31] 18 ⎕cr name)/Error           ⍝ Can not read file ? → Error
> z←⎕fio[26] 18 ⎕cr name                        ⍝ First pass, read the whole 
> file
> lines←⍳+/((↑"\n")=z)                          ⍝ Compute the iota for each line
> z←(⍴lines)⍴⍬                                  ⍝ Preallocate "z" to the right 
> size
> fd←⎕fio[3] 18 ⎕cr name                        ⍝ Open the file
> ⊣ {⊣z[⍵]←⊂19 ⎕cr ⎕ucs ¯1↓⎕fio[8] fd} ⍤0 lines ⍝ Put each line in the 
> preallocated "z"
> ⊣ ⎕fio[4] fd ⋄ →0                             ⍝ Close the file and return
> Error: ⎕ES ∊'Error on file "',name,'": ',⎕fio[2] | ⎕fio[1] ''
>
> 
> 
> alpha_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
> remove_blank_lines←{(∊0≠⍴¨⍵)/⍵}
> tolower←{('abcdefghijklmnopqrstuvwxyz',⎕av)[('ABCDEFGHIJKLMNOPQRSTUVWXYZ',⎕av)⍳⍵]}
> vertical←{,[⍳0]⍵}
> words_only←{(⍵∊'abcdefghijklmnopqrstuvwxyz ')/⍵←tolower ⍵}
> 
>      ⍝ then ...
> 
>      z←remove_blank_lines alpha_only ¨ tolower ¨ readfile_fast 'big.txt'
> 
>      ⍴ z
> 103561
>      ⍝ here you have 103,561 lines, no empty ones, clean of special 
> characters (but may have several blanks between each word).
> 
>      ⌊/⍴¨z  ⍝ minimum line length, probable "I"
> 1
> 
>      ⌈/⍴¨z  ⍝ maximum line length, may contain 400 to 600 words on each line 
> of 2488 characters.
> 2488
> 
>      ⍝ at this point you have to iterate (rank operator?) over thoses 103,561 
> lines
>      ⍝ to extract all the words in each lines, saving thems (unique) and 
> count the occurence of
>      ⍝ each word.
> 
>      ⍝ since APL can't do things like count['abc'] = 0   or count['abc'] += 1 
>    (index with string on vectors)
>      ⍝ it's a near no-end issue (eg: very difficult to do, but not impossible)
> 
>      ⍝ you will NEVER win race to language like "awk" who have indexed string 
> *part* of the basic language.
> 
> my 2 cents,
> 
> Xtian.
> 
> On 2016-09-09 17:39, Ala'a Mohammad wrote:
>> Hi,
>> 
>> I'm trying to create simple spell corrector (Norvig at
>> http://norvig.com/spell-correct.html) in APL.
>> I tried but stumbled upon the frequency/count stage and could not move
>> further. The stopper was either WS Full, or apl process killed. I'm
>> assuming the main issue is 'lack of experience with APL', and thus the
>> inefficient coding.
>> 
>> ftxt ← { ⎕FIO[26] ⍵ }
>> a ← 'abcdefghijklmnopqrstuvwxyz'
>> A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
>> nl ← ⎕UCS 13
>> cr ← ⎕UCS 10
>> tab ← ⎕UCS 9
>> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
>> alphamask ← { ~ ⍵ ∊ nonalpha }
>> hist ← { (⍪∪⍵),+/∨/¨(∪⍵)∘.⍷⍵ }
>> fhist ← { hist (alphamask txt) ⊂ downcase txt ← ftxt ⍵ }
>> ⍝ file ← '/misc/small.txt' ~ 28K
>> ⍝ file ← '/misc/xaa' ~ 1.3M
>> file ← '/misc/big.txt' ⍝ ~ 6.2M
>> ⍝ following 2 lines for debugging
>> ⎕ ← ⍴w ← (alphamask txt) ⊂ downcase txt ← ftxt file
>> ⎕ ← ⍴u ← ∪w
>> fhist file
>> 
>> the errors happened inside 'hist' function, and I presume mostly due
>> to the jot dot find (if understand correctly, operating on a matrix of
>> length equal to : unique-length * words-length)
>> 
>> Is there anyway to fix the issue? and then proceed to complete the solution.
>> 
>> Also, Is this the way to create simple spell corrector in APL (that is
>> a one which is capitalizing on APL strength as an array language)?
>> 
>> I'm using
>> LinuxMint 17.1 (kernel 3.13.0-37-generic #64-Ubuntu)
>> Gnu APL 1.6 (794)
>> Zsch 5.0.2
>> Emacs 25.1.50.1
>> 
>> Best,
>> 
>> Ala'a
>> 
>> P.S: I hoped that I could create the solution in APL and then get some
>> wacks on the head from fellow experienced APL programmers before
>> submitting it as 'another solution in X language'. but the hope
>> stopped short before even getting the probability stage.
>> 
>> 
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]