bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep dfa bug


From: Charles Levert
Subject: Re: grep dfa bug
Date: Mon, 1 Aug 2005 02:50:53 -0400
User-agent: Mutt/1.4.1i

* On Monday 2005-08-01 at 09:12:03 +0900, KIMURA Koichi wrote:
> 
> I think I found bug of dfa of gawk.

You mean grep?  (Both use a dfa.)


> Situation:
> In Japanese ShiftJIS locale, half-witdth katakana in character class
> does not match appropriately.
> 
> Reproduce:
> set LANG=ja_JP.SJIS
> export LANG
> echo ABCDE | grep '/[A-E]\+/p'
> 
> Actually, A B C D E is half-width katakana character.
> (data to reprodcue is appended at end of this mail (uuencoded SJIS data))
> 
> Result:
> nothig printed.

 
> begin 644 testkana.sh
> M<V5T($Q!3D<]:F%?2E`N4TI)4PIE>'!O<address@hidden;F]T('!R:6YT"F5C!
> <:&address@hidden;address@hidden"!G<F5P("<O6[$MM5U<*R\G"@``(
> ``
> end

$ hexdump -C testkana.sh
00000000  73 65 74 20 4c 41 4e 47  3d 6a 61 5f 4a 50 2e 53  |set LANG=ja_JP.S|
00000010  4a 49 53 0a 65 78 70 6f  72 74 20 4c 41 4e 47 0a  |JIS.export LANG.|
00000020  23 6e 6f 74 20 70 72 69  6e 74 0a 65 63 68 6f 20  |#not print.echo |
00000030  b1 b2 b3 b4 b5 20 7c 20  67 72 65 70 20 27 2f 5b  |..... | grep '/[|
00000040  b1 2d b5 5d 5c 2b 2f 27  0a                       |.-.]\+/'.|

This shell script has several problems:

   -- it shouldn't be "set LANG=ja_JP.SJIS"
      but just "LANG=ja_JP.SJIS" (better yet,
      use LC_ALL instead to be sure to override
      any other environment variable);

   -- there shouldn't be slashes around the
      regular expression (that being awk or
      sed syntax).

Fixing those two problems, I do get a match
using current CVS grep.

However, using a more recent version of
regex.c et al. (as recently discussed on the
mailing list), I get a "grep: Invalid collation
character" error with an exit code of 2.

Repeating an equivalent experiment with UTF-8, it
works fine no matter what version of grep I use:

   $ echo 'アイウエオ' | LC_ALL=ja_JP.utf8 grep '[ア-オ]\+'
   アイウエオ

Strangely, this

   $ echo 'アイウエオ' | LC_ALL=en_US.utf8 grep '[ア-オ]\+'

only works with the recent regex.c and produces
the same error as above without it.
(I.e., just the opposite as with ja_JP.SJIS).

Is any UTF-8 locale supposed to know about the
collation order of languages other than its
main one (here en_US about ja_JP)?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]