[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I wa
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Fri, 20 Jul 2012 22:44:28 +0900
On Fri, Jul 20, 2012 at 8:56 PM, Дмитрий <address@hidden> wrote:
> As for the character classes, they can be generated quite easily from the
> UnicodeData.txt file. We can get a general category from this file
> by sth like (string->symbol (caddr (string-split line ","))); then we just
> need to map the categories into appropriate character classes (e.g. Lu
> belongs to upper, alpha, alphanum, graph), etc. and merge characters if the
> characters of the same categories if they have adjacent codes.
> It's quite easy to do. If I'm not lazy I'll do this this weekend.
Full unicode character classes and case handling
are already in the utf8 egg.
These are not yet integrated with irregex because
irregex is written to be portable across any Scheme,
and so it uses its own char-set implementation. When
R7RS is released I'll re-package irregex accordingly.
Unfortunately, while the utf8 char-sets are very
compact, the DFA conversion of large, sparse Unicode
char-sets is quite large. I'd like eventually to make
a non-backtracking NFA regex matcher which only
compiles to DFA when you really need the speed.
In the meantime, a fast lookup table for the
script of a character would be nice, and this could
be use to tokenize a string of mixed-language text.
I thought I had this and can't seem to find it anywhere...