Re: character ranges in regular expressions

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character ranges in regular expressions

From:	Bruno Haible
Subject:	Re: character ranges in regular expressions
Date:	Fri, 24 Sep 2010 13:27:43 +0200
User-agent:	KMail/1.9.9

Paolo,

> Yes, this is what I'm curious about.  Why does the table have the
> order A..Za..z for en_US.UTF-8 and aAbB...yYzZ for cs_CZ.UTF-8, even
> though strcoll uses the latter in both locales?

I don't know.

But what is the "correct" result in the first place?

On my glibc-2.8 system I have a number of locales installed, and grep from
versions 2.4.2 to 2.7.
When I create a file that has every printable ASCII character, one per line,
and do a "grep '[A-Z]'" of this file, which ASCII characters should I get?
      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or    51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ   ?

Find attached the input files and the results of the command

  for l in `locale -a`; do
    echo -n "$l "; LC_ALL=$l grep '[A-Z]' ascii1 | wc -l;
  done | expand -t 20

- In grep 2.4.2 the result was 51 for nearly all locales.
- In grep 2.5.3 the result was 51 for most UTF-8 locales
  but 26 for most unibyte locales.
- In grep 2.6.3 the result was again like in 2.4.2.
- In grep 2.7 the result is mixed, I cannot see a pattern.
  For en_US the result is 51, for en_US.utf8 it's 26 -
  this definitely is a bug, since the locale definition for
  en_US and en_US.utf8 is the same.
  For cs_CZ and cs_CZ.utf8 both it's 51.
  For zh_CN and zh_CN.UTF-8 it's 26.
- An additional bug is that in the vi_VN.tcvn locale,
  grep 2.7 gives an error 'unbalanced ['.

What is the correct result for 'grep' and for regex? (I assume it's the
same for both, since both are specified by POSIX.)

Bruno

ascii1
Description: Text document

result-2.4.2
Description: Text document

result-2.5.3
Description: Text document

result-2.7
Description: Text document

result-2.6.3
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 1/2] dfa: process range expressions consistently with system regex, (continued)
- [PATCH 1/2] dfa: process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21
  - Re: [PATCH 1/2] dfa: process range expressions consistently with system regex, Paolo Bonzini, 2010/09/22
- [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/21
  - Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
    - Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
    - Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
    - Re: [PATCH 2/2] tests: add testcase for previous fix, Paul Eggert, 2010/09/23
    - Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
    - Re: character ranges in regular expressions, Bruno Haible, 2010/09/23
    - Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
    - Re: character ranges in regular expressions, Bruno Haible <=
    - Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
    - Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
    - Re: character ranges in regular expressions, Paul Eggert, 2010/09/24
    - Re: character ranges in regular expressions, Eric Blake, 2010/09/24
- [PATCH 0/2] process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21
  - [PATCH 1/2] dfa: process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21
  - [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/21

Prev by Date: Re: character ranges in regular expressions
Next by Date: Re: character ranges in regular expressions
Previous by thread: Re: character ranges in regular expressions
Next by thread: Re: character ranges in regular expressions
Index(es):
- Date
- Thread