chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: German Umlauts / UTF8 with comparse


From: Christoph Lange
Subject: Re: German Umlauts / UTF8 with comparse
Date: Tue, 18 Feb 2020 12:54:06 +0100

Ah!, right. Thanks! ... if I remember correctly, that was also discussed in the older mail thread about parsing Japanese, when Moritz said that he didn't want to make comparse users dependent on utf8.

Works well now, and also thanks for mentioning the ,d trick!

On Tue, Feb 18, 2020 at 12:44 PM <address@hidden> wrote:
Christoph Lange <address@hidden> wrote:
> Yes, this helps. Kind of ;-) ... using the character set
> char-set:alphabetic, my umlauts are now parsed. But I don't get them back
> in my result, at least not as printable characters. Instead, the following
> happens, and utterly confuses me:

Hmm, indeed. From what I can see, the result of parse is not encoded in
UTF-8.

I went to see comparse’s code and found that the (as-string) combiner
uses (->string) internally. But since comparse doesn’t use the utf8 egg,
it uses the core version of (->string), which happens to encode #\ä in
latin-1!

The only workaround I can think of right now is to move the conversion
back to a string out of the comparse egg and into your own, utf8 aware,
code.

This would look something like this:


(import comparse utf8 utf8-srfi-14 unicode-char-sets)

(define s "Gänsesäger 2,1")
(define s1 "Rotkehlchen 1,0")

(define (utf8-in cs)
  (satisfies (lambda (c) (char-set-contains? cs c))))

(define letter
  (utf8-in char-set:alphabetic))

(define letters
  (repeated letter 1 20))

(define (parse-as-string parser input)
  (list->string (parse parser input)))

(define p1 (parse-as-string letters (string->list s1)))
(define p (parse-as-string letters (string->list s)))


PS: a trick I used to check the encoding of the strings was using the ,d
csi command, which prints the contents of the string byte by byte. There
it’s easy to see if non ascii characters indeed take more than one byte
as they should in UTF-8.


--
Christoph Lange
Lotsarnas Väg 8
430 83 Vrångö

reply via email to

[Prev in Thread] Current Thread [Next in Thread]