bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep-2.10 testing


From: Bruno Haible
Subject: Re: grep-2.10 testing
Date: Mon, 21 Nov 2011 02:45:51 +0100
User-agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; )

Hi Jim,

> Stepping through that test [word-delim-multibyte] manually,
> (and what I should have done in the first place)
> I see this:
> 
>     openbsd$ e_acute=$(printf '\303\251')
>     openbsd$ echo "$e_acute" > in || framework_failure_
>     openbsd$ LC_ALL=en_US.UTF-8
>     -bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
>     openbsd$ export LC_ALL
> 
> So the real problems lies elsewhere.
> You could argue that the require_en_utf8_locale_
> function, run just prior, should have detected that the
> desired locale is not available.

I'm not sure why bash prints this error message.

> However, it runs a little helper program like this:
> 
>     openbsd$ tests/get-mb-cur-max en_US.UTF-8
>     4
> 
> which, since it prints a number in [3-6], does suggest that
> the locale exists.

There are generally two ways to check whether an UTF-8 locale is really
available:
  1) Does setlocale (LC_ALL, name) return non-NULL?
     The fact that tests/get-mb-cur-max succeeds implies that yes.
  2) Does nl_langinfo (CODESET) return a sensible value?
     Yes, in OpenBSD 4.9 it returns the string "UTF-8". Perfect.

> For the record, here are the <byte,boolean> isalpha pairs on OpenBSD 4.9.
> Note how there are many '1's after 127.  There are none with glibc.

The OpenBSD 4.9 table apparently corresponds to ISO-8859-1.

ISO C 99 section 7.4.(1) says:
  "In all cases the argument is an int, the value of which shall be
   representable as an unsigned char or shall equal the value of
   the macro EOF. If the argument has any other value, the behavior
   is undefined."

Thus you are allowed to call isalnum ((unsigned char) '\303'),
but the value will be implementation dependent.

The behaviour of glibc and OpenBSD in their UTF-8 locales are
therefore both valid.

> /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
> #define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
> ...
> 
>   if (! initialized)
>     {
>       initialized = 1;
>       for (i = 0; i < NOTCHAR; ++i)
>         if (IS_WORD_CONSTITUENT(i))
>           setbit(i, letters);
>       setbit(eolbyte, newline);
>     }

If you want to make this loop more robust, use

          if (btowc (i) != WEOF && IS_WORD_CONSTITUENT(i))

and, of course, use the gnulib module 'btowc' to get an ISO C 99
compliant btowc function.

> Debugging it, the first symptom I found is that
> grep's DFA transition table is different on *BSD systems.
> I first noticed in dfa.c's dfaexec:
> 
>     # With s=0 and *p=195 (aka \303)
>     (gdb-openbsd) p trans[s][*p]
>     $1 = 3
> 
> On other systems, that value is 0.
> 
> Why the difference?
> To answer that, you have to peer into build_state_zero->build_state->dfastate,

I don't know why the grep DFA splits the multibyte character "\303\251"
into bytes, rather than treating it with functions for multibyte characters
(mbrtowc, iswalpha, and so on). You know the idea of grep's algorithms better
than I do. (I don't even know what this 'trans' table means.)

If I change the definition of IS_WORD_CONSTITUENT in src/dfa.c to

/* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
#define IS_WORD_CONSTITUENT(C) (btowc (C) != WEOF && (isalnum(C) || (C) == '_'))

then the test behaves as expected:

  XFAIL: word-delim-multibyte

But of course, on glibc systems, you don't want the performance penalty from
the btowc call. Possibly you will want to cache the btowc results in an
array of length 256.

Bruno
-- 
In memoriam Kerem Yılmazer <http://en.wikipedia.org/wiki/Kerem_Yılmazer>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]