chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Neophyte in scheme: string-split not quite what I wa


From: Дмитрий
Subject: Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Date: Fri, 20 Jul 2012 15:56:59 +0400

Hello again. :)

> I don't know what Alex's plan is for UTF8 support, but if you're willing
> to put in the effort to define character classes for the ranges you
> mentioned, possibly you could contribute them to the (upstream) irregex
> project. If the definition of these sets are big, maybe we could turn it
> into an optional add-in library.
  Well, the problem is that the Unicode is not really logical (like ASCII is),
so there will be lots of very small subranges, and mathing these will
probably be ineffective.



  As for the character classes, they can be generated quite easily from the
UnicodeData.txt[1] file. We can get a general category[2] from this file
by sth like (string->symbol (caddr (string-split line ","))); then we just
need to map the categories into appropriate character classes (e.g. Lu
belongs to upper, alpha, alphanum, graph), etc. and merge characters if the
characters of the same categories if they have adjacent codes.
  It's quite easy to do. If I'm not lazy I'll do this this weekend.

> This could be due to terminal and locale settings.
Well, UTF-8 in Windows console is known to be seriously broken. If I were to
need an UTF-8 console, I would install the Cygwin terminal; but right now
I'm mostly happy with cp866.

[1] http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
[2] http://www.unicode.org/reports/tr44/#General_Category_Values

 -- 
Yours sincerely,
Dmitry Kushnariov



reply via email to

[Prev in Thread] Current Thread [Next in Thread]