chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] Strange output from unit irregex


From: Henry Hu
Subject: [Chicken-users] Strange output from unit irregex
Date: Tue, 3 Jul 2018 17:40:13 -0400

Hello,

I am getting a strange result from unit irregex having to do with matching character sets.

I recently upgraded to 4.13.0 to get the bug fix having to do with an extra empty list in the SRE: https://github.com/ashinn/irregex/pull/18.  I was happy to find that "[]" bracketed character sets without "^" are working beautifully!  I am, however, observing strange things with the "^" exclusion character.

The ⾀ character has three bytes and when displayed in byte form, looks like `\342\276\200`:

INPUT:
(use irregex) ; Not doing (use utf8) because I want start-index and end-index to function correctly
(irregex-match-substring (irregex-search (irregex "[^⾀]" 'utf8) "⾀⾀⾀"))

EXPECTED OUTPUT:
Considering a UTF-8 character as a single character anywhere it appears:  `#f`
Considering a UTF-8 character as a single character sometimes and a byte string sometimes:  `<the first byte of ⾀>` (displayed as `\342`), or #f
Considering a UTF-8 character as a byte string always: #f

OUTPUT:
`<the first byte of ⾀><the second byte of ⾀>` (looks like `\342\276`)

EVEN WORSE:
(irregex-match-substring (irregex-search (irregex "[^Ç]" 'utf8) "Ç")) ---> "Ç" ; A two-byte character

Am I doing something wrong?  Is "^" not designed to be used with multibyte characters?  Why would it return two bytes and not 0, 1, or 3?

Thank you!

reply via email to

[Prev in Thread] Current Thread [Next in Thread]