guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: make check fails if no en_US.iso88591 locale


From: Mike Gran
Subject: Re: make check fails if no en_US.iso88591 locale
Date: Tue, 8 Sep 2009 18:28:52 -0700 (PDT)

> From: Neil Jerram <address@hidden>
> 
> make check fails for me in regexp.test:
> 
>   ...
>   Running regexp.test
>   guile: uncaught throw to unresolved: ()
> 
> because I don't have an en_US.iso88591 locale installed, and so
> 
>   (with-locale "en_US.iso88591" ...)
> 
> throws an 'unresolved exception.
> 

My bad.  Actually, I should have enclosed the 'with-locale' in the
context of a 'pass-if', which would have caught the exception.

> I can allow make check to complete by changing that line to
> 
>   (false-if-exception (with-locale "en_US.iso88591"
> 
> but I doubt that's the best fix.  Is the "en_US.iso88591" locale
> actually important for the enclosed tests?

It is important.  This is one of the problems with the whole Unicode
effort.  There is no Unicode-capable regex library.  The regexp.test
tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
to prep the string for dispatch to the libc regex calls and
scm_from_locale_string to send them back.  

If the current locale is C or ASCII, bytes above 127 will cause errors.
If the current locale is UTF-8, bytes above 127 will be converted into
multibyte sequences that won't be matched by the regular expression
being tested.  To pass the test in regexp.test, we need to use the 
encoding that matches all of the codepoints 0 to 255 to single byte
characters, which is ISO-8859-1.

So until a better regex comes along, wrapping regex in an
8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding
errors when encoding arbitrary 8-bit data like the test does.

The reason why this problem is cropping up now and didn't occur before
is because the old scm_to_locale_string was just a stub that passed
8-bit data through unmodified.

This regex library actually can be used with arbitrary Unicode data
but it takes extra care.  UTF-8 can be used as the locale, and, then
regular expression must be written keeping in mind that each non-ASCII
character is really a multibyte string.

> 
> Thanks,
>         Neil

Thanks,

Mike




reply via email to

[Prev in Thread] Current Thread [Next in Thread]