coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] tr: case mapping anomaly


From: Eric Blake
Subject: Re: [coreutils] tr: case mapping anomaly
Date: Fri, 24 Sep 2010 17:22:34 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3

On 09/24/2010 04:47 PM, Pádraig Brady wrote:
I was just looking at a bug reported to fedora there where this abort()s

  $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'

Behavior is already unspecified by POSIX when string1 is longer than string2. But given what POSIX does say:

"When both the -d and -s options are specified, any of the character class names shall be accepted in string2. Otherwise, only character class names lower or upper are valid in string2 and then only if the corresponding character class ( upper and lower, respectively) is specified in the same relative position in string1. Such a specification shall be interpreted as a request for case conversion. When [: lower:] appears in string1 and [: upper:] appears in string2, the arrays shall contain the characters from the toupper mapping in the LC_CTYPE category of the current locale. When [: upper:] appears in string1 and [: lower:] appears in string2, the arrays shall contain the characters from the tolower mapping in the LC_CTYPE category of the current locale. The first character from each mapping pair shall be in the array for string1 and the second character from each mapping pair shall be in the array for string2 in the same relative position.

Except for case conversion, the characters specified by a character class expression shall be placed in the array in an unspecified order.
...

However, in a case conversion, as described previously, such as:

tr -s '[:upper:]' '[:lower:]'

the last operand's array shall contain only those characters defined as the second characters in each of the toupper or tolower character pairs, as appropriate."



I interpret this to mean that even though there are 59 lower and 56 upper in en_US, there are a fixed number of toupper case-mapping pairs, and there are likewise a fixed number of tolower case-mapping pairs. Therefore, [:upper:] and [:lower:] expand to the same number of array entries (whether that is 59 pairs or 56 pairs is irrelevant), and mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convert space to underscore and also guarantee that no lower-case letter becomes an underscore.

Your question is basically what should we do on the unspecified behavior of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since that falls outside the bounds of POSIX.

I.E. 0xDE (the last upper char) is output from:

  $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'

That matches the behavior we choose in all other instances where string1 is longer than string2, where GNU tr follows BSD behavior of padding the last character of string2 to meet the length of string1.

But, since POSIX is clear that the order of [:upper:] mappings is unspecified, I agree that it is not a good guarantee to the user of which byte gets duplicated to fill out the conversion, and we are better off rejecting that attempted usage.


That seems quite inconsistent given that other classes
are not allowed in string 2 when translating:

  $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
  tr: when translating, the only character classes that may appear in
  string2 are `upper' and `lower'

For consistency I think it better to keep the classes
in string 2 just for case mapping, and do something like:

  $ tr '[:upper:] ' '[:lower:]'
  tr: when not truncating set1, a character class can't be
  the last entity in string2

I'd rather see it phrased:

When string2 is shorter than string1, a character class can't be the last entity in string2.


Note BSD allows extending the above, but that's at least
consistent with any class being allowed in string2.
I.E. this is disallowed by coreutils but Ok on BSD:

  $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
  1A2

The BSD behavior violates an explicit POSIX wording; we can't do an extension like that without either turning on a POSIXLY_CORRECT check or adding a command line option, neither of which I think is necessary. So I see no reason to copy the BSD behavior of allowing any character class.


Is it OK to change tr like this?
I can't see anything depending on that.

Seems reasonable to me, once we decide on the error message wording.

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]