bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep-2.10 testing


From: Jim Meyering
Subject: Re: grep-2.10 testing
Date: Mon, 21 Nov 2011 14:30:57 +0100

Bruno Haible wrote:
>> Stepping through that test [word-delim-multibyte] manually,
>> (and what I should have done in the first place)
>> I see this:
>>
>>     openbsd$ e_acute=$(printf '\303\251')
>>     openbsd$ echo "$e_acute" > in || framework_failure_
>>     openbsd$ LC_ALL=en_US.UTF-8
>>     -bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
>>     openbsd$ export LC_ALL
>>
>> So the real problems lies elsewhere.
>> You could argue that the require_en_utf8_locale_
>> function, run just prior, should have detected that the
>> desired locale is not available.
>
> I'm not sure why bash prints this error message.
>
>> However, it runs a little helper program like this:
>>
>>     openbsd$ tests/get-mb-cur-max en_US.UTF-8
>>     4
>>
>> which, since it prints a number in [3-6], does suggest that
>> the locale exists.
>
> There are generally two ways to check whether an UTF-8 locale is really
> available:
>   1) Does setlocale (LC_ALL, name) return non-NULL?
>      The fact that tests/get-mb-cur-max succeeds implies that yes.
>   2) Does nl_langinfo (CODESET) return a sensible value?
>      Yes, in OpenBSD 4.9 it returns the string "UTF-8". Perfect.
>
>> For the record, here are the <byte,boolean> isalpha pairs on OpenBSD 4.9.
>> Note how there are many '1's after 127.  There are none with glibc.
>
> The OpenBSD 4.9 table apparently corresponds to ISO-8859-1.
>
> ISO C 99 section 7.4.(1) says:
>   "In all cases the argument is an int, the value of which shall be
>    representable as an unsigned char or shall equal the value of
>    the macro EOF. If the argument has any other value, the behavior
>    is undeļ¬ned."
>
> Thus you are allowed to call isalnum ((unsigned char) '\303'),
> but the value will be implementation dependent.
>
> The behaviour of glibc and OpenBSD in their UTF-8 locales are
> therefore both valid.
>
>> /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
>> #define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
>> ...
>>
>>   if (! initialized)
>>     {
>>       initialized = 1;
>>       for (i = 0; i < NOTCHAR; ++i)
>>         if (IS_WORD_CONSTITUENT(i))
>>           setbit(i, letters);
>>       setbit(eolbyte, newline);
>>     }
>
> If you want to make this loop more robust, use
>
>           if (btowc (i) != WEOF && IS_WORD_CONSTITUENT(i))

Good idea.  Patch below.

> and, of course, use the gnulib module 'btowc' to get an ISO C 99
> compliant btowc function.

Thanks.  We already use it.

> I don't know why the grep DFA splits the multibyte character "\303\251"
> into bytes, rather than treating it with functions for multibyte characters
> (mbrtowc, iswalpha, and so on). You know the idea of grep's algorithms better
> than I do. (I don't even know what this 'trans' table means.)

It's the DFA state transition table.

> If I change the definition of IS_WORD_CONSTITUENT in src/dfa.c to
>
> /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
> #define IS_WORD_CONSTITUENT(C) (btowc (C) != WEOF && (isalnum(C) || (C) == 
> '_'))
>
> then the test behaves as expected:
>
>   XFAIL: word-delim-multibyte
>
> But of course, on glibc systems, you don't want the performance penalty from
> the btowc call. Possibly you will want to cache the btowc results in an
> array of length 256.

That might be worthwhile, since it's called 256 times in these places:
  dfa initialization (several times)
  dfa analysis, with \w,\W in the pattern
  dfaexec

However, with a patch like this one, on glibc systems there will be no cost:

diff --git a/src/dfa.c b/src/dfa.c
index e28726d..8f79508 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1071,8 +1071,18 @@ parse_bracket_exp (void)
   return CSET + charclass_index(ccl);
 }

+/* Add this to the test for whether a byte is word-constituent, since on
+   BSD-based systems, many values in the 128..255 range are classified as
+   alphabetic, while on glibc-based systems, they are not.  */
+#ifdef __GLIBC__
+# define octet_valid_as_wide_char(c) 1
+#else
+# define octet_valid_as_wide_char(c) (MBS_SUPPORT && btowc (c) != WEOF)
+#endif
+
 /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
-#define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
+#define IS_WORD_CONSTITUENT(C) \
+  (octet_valid_as_wide_char(C) && (isalnum(C) || (C) == '_'))

 static token
 lex (void)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]