coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] tr: case mapping anomaly


From: Pádraig Brady
Subject: Re: [coreutils] tr: case mapping anomaly
Date: Sat, 25 Sep 2010 07:52:39 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 25/09/10 00:22, Eric Blake wrote:
> On 09/24/2010 04:47 PM, Pádraig Brady wrote:
>> I was just looking at a bug reported to fedora there where this abort()s
>>
>>   $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'
> 
> Behavior is already unspecified by POSIX when string1 is longer than
> string2.  But given what POSIX does say:
> 
> "When both the -d and -s options are specified, any of the character
> class names shall be accepted in string2. Otherwise, only character
> class names lower or upper are valid in string2 and then only if the
> corresponding character class ( upper and lower, respectively) is
> specified in the same relative position in string1. Such a specification
> shall be interpreted as a request for case conversion. When [: lower:]
> appears in string1 and [: upper:] appears in string2, the arrays shall
> contain the characters from the toupper mapping in the LC_CTYPE category
> of the current locale. When [: upper:] appears in string1 and [: lower:]
> appears in string2, the arrays shall contain the characters from the
> tolower mapping in the LC_CTYPE category of the current locale. The
> first character from each mapping pair shall be in the array for string1
> and the second character from each mapping pair shall be in the array
> for string2 in the same relative position.
> 
> Except for case conversion, the characters specified by a character
> class expression shall be placed in the array in an unspecified order.
> ...
> 
> However, in a case conversion, as described previously, such as:
> 
> tr -s '[:upper:]' '[:lower:]'
> 
> the last operand's array shall contain only those characters defined as
> the second characters in each of the toupper or tolower character pairs,
> as appropriate."
> 
> 
> 
> I interpret this to mean that even though there are 59 lower and 56
> upper in en_US, there are a fixed number of toupper case-mapping pairs,
> and there are likewise a fixed number of tolower case-mapping pairs.
> Therefore, [:upper:] and [:lower:] expand to the same number of array
> entries (whether that is 59 pairs or 56 pairs is irrelevant), and
> mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convert
> space to underscore and also guarantee that no lower-case letter becomes
> an underscore.

Thanks for digging up the relevant POSIX bits.
Yes I agree that '[:lower:]' '[:upper:]' should
be treated as a unit and not leak into adjacent elements.

> 
> Your question is basically what should we do on the unspecified behavior
> of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since
> that falls outside the bounds of POSIX.
> 
>> I.E. 0xDE (the last upper char) is output from:
>>
>>   $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'
> 
> That matches the behavior we choose in all other instances where string1
> is longer than string2, where GNU tr follows BSD behavior of padding the
> last character of string2 to meet the length of string1.
> 
> But, since POSIX is clear that the order of [:upper:] mappings is
> unspecified, I agree that it is not a good guarantee to the user of
> which byte gets duplicated to fill out the conversion, and we are better
> off rejecting that attempted usage.
> 
>>
>> That seems quite inconsistent given that other classes
>> are not allowed in string 2 when translating:
>>
>>   $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
>>   tr: when translating, the only character classes that may appear in
>>   string2 are `upper' and `lower'
>>
>> For consistency I think it better to keep the classes
>> in string 2 just for case mapping, and do something like:
>>
>>   $ tr '[:upper:] ' '[:lower:]'
>>   tr: when not truncating set1, a character class can't be
>>   the last entity in string2
> 
> I'd rather see it phrased:
> 
> When string2 is shorter than string1, a character class can't be the
> last entity in string2.

OK. That is a bit clearer.

>> Note BSD allows extending the above, but that's at least
>> consistent with any class being allowed in string2.
>> I.E. this is disallowed by coreutils but Ok on BSD:
>>
>>   $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
>>   1A2
> 
> The BSD behavior violates an explicit POSIX wording; we can't do an
> extension like that without either turning on a POSIXLY_CORRECT check or
> adding a command line option, neither of which I think is necessary.  So
> I see no reason to copy the BSD behavior of allowing any character class.

Yes I agree. I was just pointing out what BSD does here.

>> Is it OK to change tr like this?
>> I can't see anything depending on that.
> 
> Seems reasonable to me, once we decide on the error message wording.

Great, I'll change it as above.

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]