bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character ranges in regular expressions


From: Bruno Haible
Subject: Re: character ranges in regular expressions
Date: Fri, 24 Sep 2010 13:27:43 +0200
User-agent: KMail/1.9.9

Paolo,

> Yes, this is what I'm curious about.  Why does the table have the
> order A..Za..z for en_US.UTF-8 and aAbB...yYzZ for cs_CZ.UTF-8, even
> though strcoll uses the latter in both locales?

I don't know.

But what is the "correct" result in the first place?

On my glibc-2.8 system I have a number of locales installed, and grep from
versions 2.4.2 to 2.7.
When I create a file that has every printable ASCII character, one per line,
and do a "grep '[A-Z]'" of this file, which ASCII characters should I get?
      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or    51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ   ?

Find attached the input files and the results of the command

  for l in `locale -a`; do
    echo -n "$l "; LC_ALL=$l grep '[A-Z]' ascii1 | wc -l;
  done | expand -t 20

- In grep 2.4.2 the result was 51 for nearly all locales.
- In grep 2.5.3 the result was 51 for most UTF-8 locales
  but 26 for most unibyte locales.
- In grep 2.6.3 the result was again like in 2.4.2.
- In grep 2.7 the result is mixed, I cannot see a pattern.
  For en_US the result is 51, for en_US.utf8 it's 26 -
  this definitely is a bug, since the locale definition for
  en_US and en_US.utf8 is the same.
  For cs_CZ and cs_CZ.utf8 both it's 51.
  For zh_CN and zh_CN.UTF-8 it's 26.
- An additional bug is that in the vi_VN.tcvn locale,
  grep 2.7 gives an error 'unbalanced ['.

What is the correct result for 'grep' and for regex? (I assume it's the
same for both, since both are specified by POSIX.)

Bruno

Attachment: ascii1
Description: Text document

Attachment: result-2.4.2
Description: Text document

Attachment: result-2.5.3
Description: Text document

Attachment: result-2.7
Description: Text document

Attachment: result-2.6.3
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]