[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character ranges in regular expressions
From: |
Eric Blake |
Subject: |
Re: character ranges in regular expressions |
Date: |
Fri, 24 Sep 2010 16:27:53 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3 |
On 09/24/2010 03:52 PM, Bruno Haible wrote:
1) Is there an agreement of what the result should be? Jim seems to prefer to
extrapolate the result of the "C" locale, i.e. 26.
As do I.
For other people, the locale
dependent behaviour is useful, that is, 51 is desired.
Which is why my proposal is that glibc consider:
[A-Z] => match C locale; 26 letters, regardless of locale
[[.A.]-[.Z.]] => use collation rules, since we explicitly spelled things
with collation symbols (26 letters in POSIX local, 51 or even more in
other locales, since accented characters might be included in the
collation range), so that we aren't completely losing CEO behavior (if
someone seriously has a reason to use it)
[[:upper:]] => per POSIX rules in all locales
as well as:
clean up all the locale tables to make CEO consistent with strcoll,
rather than having some bizarre locales like cs_CZ (the locale
definition file is what determines both strcoll and CEO ordering, it's
just that you can rearrange lines within a locale definition with the
result of leaving strcoll unchanged but impacting CEO - so the bug in
screwy locales like cs_CZ is that they didn't follow a common layout
pattern in the locale definition file).
From around 2000, I
remember a mail from Ulrich Drepper where he essentially said "you have to
learn that in other locales range expressions work differently, use [[:alpha:]]
instead".
But 2000 was in the timeframe where the POSIX rules on CEO were still
current; that POSIX rule was relaxed in 2001, such that POSIX itself
admits that CEO has a number of short-comings, and mentions that native
ordering (ie. matching the C locale) is a valid implementation option.
2) Is Ulrich aware that the subtle differences in the localedata/locales/*
files lead to bizarre behaviour of regexec() in the cs_CZ, pl_PL, etc. locales?
If he still actively reads glibc bugs, yes:
http://sourceware.org/bugzilla/show_bug.cgi?id=12045
http://sourceware.org/bugzilla/show_bug.cgi?id=12051
--
Eric Blake address@hidden +1-801-349-2682
Libvirt virtualization library http://libvirt.org
- Re: [PATCH 2/2] tests: add testcase for previous fix, (continued)
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paul Eggert, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/23
- Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
- Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
- Re: character ranges in regular expressions, Paul Eggert, 2010/09/24
- Re: character ranges in regular expressions,
Eric Blake <=
[PATCH 0/2] process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21