guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

`regexp-exec' and non-ascii strings


From: Clinton Ebadi
Subject: `regexp-exec' and non-ascii strings
Date: Sun, 06 Mar 2011 14:52:41 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

Greetings,

While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
calling scm_regexp_exec on any utf-8 strings I eventually realized
that... the string was actually single-byte encoded internally. After
taking that down the wrong path I eventually tested `regexp-exec' with a
*valid* latin-1 string and that too aborted in `fixup_multibyte_match'.

I have attached a patch that I think is correct. Instead of
unconditionally calling `fixup_multibyte_match' when wchar_t is
available it instead checks if the scheme string being matched is
actually a multibyte string. This permits applications that provide no
string encoding and non-ascii strings to be matched.

If you call `setlocale' with any locale things sort of work. In the case
of "C" non-ascii characters are escaped upon read, and in the case of
"latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an
expert in this area).

Unfortunately this means I don't see an easy way to write a test for the
suite--it only happens in the case where the locale is "C" and no port
encoder is set. <http://paste.lisp.org/display/120245#5> is what I was
going for and will show the bug if run by hand.

I'm not entirely certain this is the *correct* solution, but I think it
should be--it seems bad to abort() applications that uses regexeps but
haven't set their locale yet!

(My papers for Guile are on file AFAIK FWIW)

[0] http://paste.lisp.org/display/120245

Attachment: 0001-2011-03-05-Clinton-Ebadi-clinton-unknownlamer.org.patch
Description: Text Data

-- 
Jessie: but today i was a nerd
Jessie: i even read slashdot.

Attachment: pgpZSA50iPkWU.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]