[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
`regexp-exec' and non-ascii strings
From: |
Clinton Ebadi |
Subject: |
`regexp-exec' and non-ascii strings |
Date: |
Sun, 06 Mar 2011 14:52:41 -0500 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) |
Greetings,
While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
calling scm_regexp_exec on any utf-8 strings I eventually realized
that... the string was actually single-byte encoded internally. After
taking that down the wrong path I eventually tested `regexp-exec' with a
*valid* latin-1 string and that too aborted in `fixup_multibyte_match'.
I have attached a patch that I think is correct. Instead of
unconditionally calling `fixup_multibyte_match' when wchar_t is
available it instead checks if the scheme string being matched is
actually a multibyte string. This permits applications that provide no
string encoding and non-ascii strings to be matched.
If you call `setlocale' with any locale things sort of work. In the case
of "C" non-ascii characters are escaped upon read, and in the case of
"latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an
expert in this area).
Unfortunately this means I don't see an easy way to write a test for the
suite--it only happens in the case where the locale is "C" and no port
encoder is set. <http://paste.lisp.org/display/120245#5> is what I was
going for and will show the bug if run by hand.
I'm not entirely certain this is the *correct* solution, but I think it
should be--it seems bad to abort() applications that uses regexeps but
haven't set their locale yet!
(My papers for Guile are on file AFAIK FWIW)
[0] http://paste.lisp.org/display/120245
0001-2011-03-05-Clinton-Ebadi-clinton-unknownlamer.org.patch
Description: Text Data
--
Jessie: but today i was a nerd
Jessie: i even read slashdot.
pgpZSA50iPkWU.pgp
Description: PGP signature
- `regexp-exec' and non-ascii strings,
Clinton Ebadi <=