|
From: | Bruno Haible |
Subject: | Re: character ranges in regular expressions |
Date: | Fri, 24 Sep 2010 13:27:43 +0200 |
User-agent: | KMail/1.9.9 |
Paolo, > Yes, this is what I'm curious about. Why does the table have the > order A..Za..z for en_US.UTF-8 and aAbB...yYzZ for cs_CZ.UTF-8, even > though strcoll uses the latter in both locales? I don't know. But what is the "correct" result in the first place? On my glibc-2.8 system I have a number of locales installed, and grep from versions 2.4.2 to 2.7. When I create a file that has every printable ASCII character, one per line, and do a "grep '[A-Z]'" of this file, which ASCII characters should I get? 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z or 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ ? Find attached the input files and the results of the command for l in `locale -a`; do echo -n "$l "; LC_ALL=$l grep '[A-Z]' ascii1 | wc -l; done | expand -t 20 - In grep 2.4.2 the result was 51 for nearly all locales. - In grep 2.5.3 the result was 51 for most UTF-8 locales but 26 for most unibyte locales. - In grep 2.6.3 the result was again like in 2.4.2. - In grep 2.7 the result is mixed, I cannot see a pattern. For en_US the result is 51, for en_US.utf8 it's 26 - this definitely is a bug, since the locale definition for en_US and en_US.utf8 is the same. For cs_CZ and cs_CZ.utf8 both it's 51. For zh_CN and zh_CN.UTF-8 it's 26. - An additional bug is that in the vi_VN.tcvn locale, grep 2.7 gives an error 'unbalanced ['. What is the correct result for 'grep' and for regex? (I assume it's the same for both, since both are specified by POSIX.) Bruno
ascii1
Description: Text document
result-2.4.2
Description: Text document
result-2.5.3
Description: Text document
result-2.7
Description: Text document
result-2.6.3
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |