[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep-2.10 testing
From: |
Jim Meyering |
Subject: |
Re: grep-2.10 testing |
Date: |
Mon, 21 Nov 2011 14:30:57 +0100 |
Bruno Haible wrote:
>> Stepping through that test [word-delim-multibyte] manually,
>> (and what I should have done in the first place)
>> I see this:
>>
>> openbsd$ e_acute=$(printf '\303\251')
>> openbsd$ echo "$e_acute" > in || framework_failure_
>> openbsd$ LC_ALL=en_US.UTF-8
>> -bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
>> openbsd$ export LC_ALL
>>
>> So the real problems lies elsewhere.
>> You could argue that the require_en_utf8_locale_
>> function, run just prior, should have detected that the
>> desired locale is not available.
>
> I'm not sure why bash prints this error message.
>
>> However, it runs a little helper program like this:
>>
>> openbsd$ tests/get-mb-cur-max en_US.UTF-8
>> 4
>>
>> which, since it prints a number in [3-6], does suggest that
>> the locale exists.
>
> There are generally two ways to check whether an UTF-8 locale is really
> available:
> 1) Does setlocale (LC_ALL, name) return non-NULL?
> The fact that tests/get-mb-cur-max succeeds implies that yes.
> 2) Does nl_langinfo (CODESET) return a sensible value?
> Yes, in OpenBSD 4.9 it returns the string "UTF-8". Perfect.
>
>> For the record, here are the <byte,boolean> isalpha pairs on OpenBSD 4.9.
>> Note how there are many '1's after 127. There are none with glibc.
>
> The OpenBSD 4.9 table apparently corresponds to ISO-8859-1.
>
> ISO C 99 section 7.4.(1) says:
> "In all cases the argument is an int, the value of which shall be
> representable as an unsigned char or shall equal the value of
> the macro EOF. If the argument has any other value, the behavior
> is undeļ¬ned."
>
> Thus you are allowed to call isalnum ((unsigned char) '\303'),
> but the value will be implementation dependent.
>
> The behaviour of glibc and OpenBSD in their UTF-8 locales are
> therefore both valid.
>
>> /* Return non-zero if C is a `word-constituent' byte; zero otherwise. */
>> #define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
>> ...
>>
>> if (! initialized)
>> {
>> initialized = 1;
>> for (i = 0; i < NOTCHAR; ++i)
>> if (IS_WORD_CONSTITUENT(i))
>> setbit(i, letters);
>> setbit(eolbyte, newline);
>> }
>
> If you want to make this loop more robust, use
>
> if (btowc (i) != WEOF && IS_WORD_CONSTITUENT(i))
Good idea. Patch below.
> and, of course, use the gnulib module 'btowc' to get an ISO C 99
> compliant btowc function.
Thanks. We already use it.
> I don't know why the grep DFA splits the multibyte character "\303\251"
> into bytes, rather than treating it with functions for multibyte characters
> (mbrtowc, iswalpha, and so on). You know the idea of grep's algorithms better
> than I do. (I don't even know what this 'trans' table means.)
It's the DFA state transition table.
> If I change the definition of IS_WORD_CONSTITUENT in src/dfa.c to
>
> /* Return non-zero if C is a `word-constituent' byte; zero otherwise. */
> #define IS_WORD_CONSTITUENT(C) (btowc (C) != WEOF && (isalnum(C) || (C) ==
> '_'))
>
> then the test behaves as expected:
>
> XFAIL: word-delim-multibyte
>
> But of course, on glibc systems, you don't want the performance penalty from
> the btowc call. Possibly you will want to cache the btowc results in an
> array of length 256.
That might be worthwhile, since it's called 256 times in these places:
dfa initialization (several times)
dfa analysis, with \w,\W in the pattern
dfaexec
However, with a patch like this one, on glibc systems there will be no cost:
diff --git a/src/dfa.c b/src/dfa.c
index e28726d..8f79508 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1071,8 +1071,18 @@ parse_bracket_exp (void)
return CSET + charclass_index(ccl);
}
+/* Add this to the test for whether a byte is word-constituent, since on
+ BSD-based systems, many values in the 128..255 range are classified as
+ alphabetic, while on glibc-based systems, they are not. */
+#ifdef __GLIBC__
+# define octet_valid_as_wide_char(c) 1
+#else
+# define octet_valid_as_wide_char(c) (MBS_SUPPORT && btowc (c) != WEOF)
+#endif
+
/* Return non-zero if C is a `word-constituent' byte; zero otherwise. */
-#define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
+#define IS_WORD_CONSTITUENT(C) \
+ (octet_valid_as_wide_char(C) && (isalnum(C) || (C) == '_'))
static token
lex (void)
- grep-2.9.69-f91c testing, (continued)
- Re: grep-2.10 testing (was: grep-2.9.69-f91c testing), Bruno Haible, 2011/11/20
- Re: grep-2.10 testing, Jim Meyering, 2011/11/20
- Message not available
- Re: grep-2.10 testing, Jim Meyering, 2011/11/20
- Re: grep-2.10 testing, Bruno Haible, 2011/11/20
- Re: grep-2.10 testing,
Jim Meyering <=
- Re: grep-2.10 testing, Bruno Haible, 2011/11/21
- Re: grep-2.10 testing, Jim Meyering, 2011/11/21
- Re: grep-2.10 testing, Jim Meyering, 2011/11/21