Re: German Umlauts / UTF8 with comparse

On Tue, Feb 18, 2020 at 12:44 PM <address@hidden> wrote:

Christoph Lange <address@hidden> wrote:
> Yes, this helps. Kind of ;-) ... using the character set
> char-set:alphabetic, my umlauts are now parsed. But I don't get them back
> in my result, at least not as printable characters. Instead, the following
> happens, and utterly confuses me:

Hmm, indeed. From what I can see, the result of parse is not encoded in
UTF-8.

I went to see comparse’s code and found that the (as-string) combiner
uses (->string) internally. But since comparse doesn’t use the utf8 egg,
it uses the core version of (->string), which happens to encode #\ä in
latin-1!

The only workaround I can think of right now is to move the conversion
back to a string out of the comparse egg and into your own, utf8 aware,
code.

This would look something like this:

(import comparse utf8 utf8-srfi-14 unicode-char-sets)

(define s "Gänsesäger 2,1")
(define s1 "Rotkehlchen 1,0")

(define (utf8-in cs)
(satisfies (lambda (c) (char-set-contains? cs c))))

(define letter
(utf8-in char-set:alphabetic))

(define letters
(repeated letter 1 20))

(define (parse-as-string parser input)
(list->string (parse parser input)))

(define p1 (parse-as-string letters (string->list s1)))
(define p (parse-as-string letters (string->list s)))

PS: a trick I used to check the encoding of the strings was using the ,d
csi command, which prints the contents of the string byte by byte. There
it’s easy to see if non ascii characters indeed take more than one byte
as they should in UTF-8.

From:	Christoph Lange
Subject:	Re: German Umlauts / UTF8 with comparse
Date:	Tue, 18 Feb 2020 12:54:06 +0100