[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I wa
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Sat, 21 Jul 2012 14:34:17 +0400
> As I said, I'm a neophyte. My "character classes" were based around
> [a-zA-z] etc. So you can readily see why the pattern would have
> quickly become unreasonably complex.
If you don't need any exotic characters, just ASCII (and, probably, a small
superset of Unicode), character classes would be extremely simple:
(use irregex utf8)
; Cyrillic letters range:
(define cyrl '(/ #\u0400 #\u05012))
(define (split-into-classes s)
(irregex-extract `(or (+ (or alpha ,cyrl)) (+ num)
(+ punct) (+ white)
(+ (~ alpha num punct white ,cyrl))) s))
Note that I'm also a kind of a neophyte, so there may be a better way to do
Then you can use this procedure like this:
; In Linux/Cygwin you can input "Hello world! Да." directly, but not in Windows
(split-into-classes "Hello world! \u0414\u0430.")
=> ("Hello" " " "world" "!" " " "Да" ".")
But extending this procedure to cover the whole Unicode would be tricky.
> I was planning on using Chicken to learn scheme, since R7SR is supposed
> to be based more on R5SR than on R6SR, but maybe it's better to learn
> using Racket.
It doesn't matter what tools you use as long as you have a desire to learn. I
was personally put off by Racket's extremely slow loading time.
Also note that I believe Racket doesn't have a built-in solution to split a
string into character classes either.
> (I *do* need to use utf-8 in lots of places, and an incomplete implementation
> while I was learning would be ... unpleasant. Particularly if the user
> documentation presumed that it *was* complete.)
What made you think it's incomplete? :o
Windows console's UTF-8 support is incomplete, but on the Chicken's side
everything is OK.