bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31526: Range [a-z] does not follow collate order from locale.


From: Bize Ma
Subject: bug#31526: Range [a-z] does not follow collate order from locale.
Date: Fri, 25 May 2018 00:48:33 -0400

I believe that this lines carry the esence of the answer:

        > It is outside the scope of 'sed' to define the collation order.

        > Yes, there is a locale collation order.
        > It is defined in libc not in sed, and it is not well documented.
        > GNU sed has no way to change/determine it, or document what it is.

        >>    - That the order in any other locale is secret?
        >
        > Not "secret" as in someone actively trying to hide it,
        > but unknown/undocumented because the developers of GLIBC have not
        > documented it.

        >> But none explain in clear simple words what order the characters
        >> in a bracket range will follow in a locale that is NOT C. (see
        >> some simple examples above).
        >
        > Correct - that is not documented anywhere at the moment.

So:

    - This is not a bug that sed developers could or would resolve.
    - The sort order needs to be documented by glibc.

In fact, sed developers do not support bracket ranges in a locale that is
not C:

        > Any other locale than C is unspecified: do not use them.

Best Regards

Bize Ma



-----------------------------------------------------------------------------
Some general clarifications follow:


>> In range definitions I believe that there are two goals in conflict:
>>
>>      - An stable, simple, range description for programmers.
>>      - A clear descrition (even if long) for multilanguage users.
>>
> Why are they in conflict? …

Because if a long description is required, then, it is "not simple".


> Exactly because regex ranges in multibyte locales are not well-defined,
> the recommendation is not to use them in portable sed scripts.

Portable? That is new word. It did not appeared in previous e-mails.
Why do you assume that I want/need to have only "portable" ranges?



>> **********************************************************************
>> 1.- About ASCII character numeric ranges:
[...]
> In "C/POSIX" locale, regex range [a-d] matches a,b,c,d.
> In other locales, it is not well defined (and can match many variations,
> depending on your operating system/libc).

Yes, Simple: sed defers to glibc (or other libc) the responsability
to define and implement such order.
Thus: sed developers could not support any specific range order.


[...]
>> The -E option is not (yet) defined in current POSIX (The Open Group
>> Base Specifications Issue 7, 2018 edition) for sed.
>> Yes, It is believed that it will be accepted for the next POSIX version.
>>
> Technically speaking, the "-E" option is not "unspecified".

I did not use the word "unspecified", I said: "not (yet) defined".
Please do not put words in my mouth.

> It is an extension beyond the current POSIX standard, and GNU programs
> have many such extensions.

And, as an extension, is something that the POSIX standard has not (yet)
defined.



[...]
> But how do you treat range "[a-Z]" ?

If the collating order sorts `a` before `Z`, the range is valid and
should give a "resonable" result.

    $ echo
'0123456789:;<=>address@hidden'
|
    >     LC_ALL=en_CA.utf8   sed 's/[^a-Z]//g'
    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

As you can see above, glibc (and thus sed) does not claim the range
to be invalid and thus it returns a "reasonable" result
(whatever "reasonable" is meaning here).

> This is range ASCII 97 to ASCII 90 ... is an implementation expected
> to swap the min/max values, and treat it as ASCII range 90-97 ?
> or somehow understand these are letters, and change it to ASCII 65 to 122
?

ASCII values only have an exact meaning in C locale (and (maybe) in
C.UTF-8).
And that is only because that is the collating sort order of C locale.

In other locales, the sort order is usually (very) diferent
than ASCII numeric values.



[...]
> 2. In multibyte locales, ranges of specific letters (e.g. "[A-D]")
> are not well specified and should be avoided in portable scripts.

That word again: portable. Only in portable scripts?
What should happen in all other scripts?

[...]
>> **********************************************************************
>> 3.- Correct exactly how.
[...]
>>    - That other ranges like [*-d] (valid in C) are a crazy idea?
         (No?)
>
> Instead of "crazy" let's call it "unspecified" …

Let's call it what it is: unsupported by sed.

>>    - References to collation order in the manuals must be stricken out?
        (No?)
> I'm not sure I understand this...

You said:
        I don't think it is documented to be so anywhere in GNU programs.

[...]
> The term "collation order" is defined in POSIX, e.g. here:
>
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02

I have NOT asked what the term means in POSIX, but what it means to sed.

 [...]
> Here's an example of glibc's strange behavior (or at least
> strange to me, as I found no explanation for it):
>
> In most multibyte UTF-8 locales the punctuation order
> differs from ASCII order,

Collation order is a language issue, each language has special and
many times conflicting views of what "the correct order" should be.
That is how we humans think. Consider very "simple" everyday dates,
there are as many month names as languages there are.  As many week
day names as languages there are. That is what an individual of any
culture has learnt to expect as the "natural order". All we can do,
if confronted with diverse expectations, is to accept that they do
exist and addapt to accept them.

Please take a look at the Unicode Collation page:

    http://unicode.org/reports/tr10/

> … but is consistently the same (e.g. en_CA.UTF-8 and fr_FR.UTF-8).
> For some reason, ja_JP.UTF-8 order is more like ASCII.
>
> Compare the following:
>
>   $ printf "%s\n" a A b B "á" "あ" "ひ" . , : - = > in
>   $ LC_ALL=C           sort in > out-C
>   $ LC_ALL=en_CA.UTF-8 sort in > out-CA
>   $ LC_ALL=ja_JP.UTF-8 sort in > out-JA
>   $ paste out-C out-CA out-JA
> , = ,
> - - -
> . , .
> : : :
> = . =
> A あ A
> B ひ B
> a A a
> b a b
> á á あ
> あ B ひ
> ひ b á

What all the above reveals is one order, the order that sort follows.
But you are still failing to get it:

    That is entirelly diferent than what glic follows. Try:

    $ LC_ALL=C                   sed 's/[A-B]/x/g'  out-C     >out-C-sed
    $ LC_ALL=en_CA.utf8   sed 's/[A-B]/x/g'   out-CA  >out-CA-sed
    $ LC_ALL=en_JP.utf8     sed 's/[A-B]/x/g'   out-JA  >out-JA-sed
    $ paste     out-C-sed      out-CA-sed      out-JA-sed
, = ,
- - -
. , .
: : :
= . =
あ
ひ
a a
b a b
á á あ
あ ひ
ひ b á

    The `a` and the `á` were sorted between `A` and `B`in the en_CA.utf8
locale.
    But sed did NOT match them.
    Yes, just one particular example in en_CA.utf8 locale.

[...]
>>> As such, I'm marking this as "not a bug" and closing the ticket,
>>> but discussion can continue by replying to this thread.
>>
>> I still remain in doubt, at the very minimum.
>
> I hope this helps clears things out, but I'm happy to continue
> this discussion if there are other questions.

I am clear now that this is unsupported by sed, thanks.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]