coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tr(1) with multibyte character support


From: Pádraig Brady
Subject: Re: tr(1) with multibyte character support
Date: Sat, 16 Sep 2017 14:04:23 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 15/09/17 21:31, Pádraig Brady wrote:
> On 15/09/17 00:15, Assaf Gordon wrote:
>> Hello,
>>
>> I'm looking into adding multibyte support to tr(1), and interested in
>> some feedback.
>>
>>
>> 1. "-C" vs "-c"
>> ---------------
>>
>> The POSIX tr(1) page says:
>> "-c  Complement the set of values specified by string1.
>>  -C  Complement the set of characters specified by string1."
>> ( http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html )
>>
>> This I take to mean:
>>     "-c" is single-bytes (=values) regardless of locale,
>>     "-C" is multibyte characters, depending on locale.
>>
>> First,
>> Is the above correct?
> 
> The standard is a bit confusing here but I think the above is correct.
> I find it strange that the byte/char distinction is only made for 
> --complement.
> Also which one does --complement imply? It probably should be -C I suppose
> since I'm guessing -c is an older option left specifying bytes for backwards
> compat reasons.
> 
>> Second,
>> Assuming it is correct, is the following expected output correct?
>>
>> The UTF-8 sequence '\316\243' is U+03A3 GREEK CAPITAL LETTER SIGMA 'Σ'.
>> The UTf-8 sequence '\316\250' is U+03A8 GREEK CAPITAL LETTER PSI 'Ψ'.
>>
>> POSIX unibyte locale and lower-case "-c":
>>
>>   printf '\316\243\316\250' | LC_ALL=C tr -dc '\316\250'
>>   => '\316\316\250'
>>
> 
> ack
> 
>>
>> UTF-8 locale but lower-case "-c", input set should be treated
>> as two separate single-byte octets:
>>
>>   printf '\316\243\316\250' | LC_ALL=en_US.UTf-8 tr -dc '\316\250'
>>   => '\316\316\250'
>>
> 
> ack
> 
>> POSIX unibyte locale and upper-case "-C", input set should be treated
>> as two separate single-byte octets:
>>
>>   printf '\316\243\316\250' | LC_ALL=C tr -dC '\316\250'
>>   => '\316\316\250'
> 
> Right, if hard_locale() == false,
> which might not be the case on some setups that assume UTF8
> 
>> UTF-8 locale with upper-case "-C", input is a one multibyte character:
>>
>>   printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dC '\316\250'
>>   => '\316\250'
> 
> ack
> 
>> 2. Invalid multibyte sequences in SET1/SET2 parameters
>> ------------------------------------------------------
>>
>> I assume that invalid multibyte sequences in the *input* file
>> must be outputed as-is (in accordance with other coreutils programs).
> 
> Right. Well we talked about that previously
> (and the separate program for preprocessing data)
> 
>> However, what about invalid sequences in SET1/SET2 parameters?
>> Can we reject them (and fail/refuse to run) ?
>>
>> That is, in POSIX locale, both of these are valid and mean the same
>> thing (delete two octet values):
>>
>>      LC_ALL=C tr -d '\316\250'
>>      LC_ALL=C tr -d '\250\316'
> 
> ack
> 
> 
>> But in UTF8 locale, should we accept the invalid sequence:
>>
>>      LC_ALL=en_US.UTF8 tr -d '\250\316'
>>
>> and treat it (silently) as two separate octets, or should we exit with
>> an error message (e.g. "SET1 is not valid in this locale") ?
> 
> It would be nice to error to provide feedback for invalid chars, but...
> 
>> 3. backward incompatibility
>> ---------------------------
>>
>> Also related to the previous item,
>> I think tr(1) might be a case where adding multibyte support might break
>> existing scripts, and be seen as a regression by users.
>> If someone used commands like
>>    tr -d '\200-\377'
>>    tr -d '\316\250'
>> And these have worked for many years regardless of locale, adding
>> multibyte support might disrupt this.
>>
>> What do you think ? perhaps this usage is not so common, and it won't be
>> too big of a disruption ?
> 
> Well it's not silent corruption which is better.
> This gets back to my question as to why -C was introduced to
> seemingly cater for this ambiguity, while the non complemented case
> is left with backwards compat issues like this.
> 
> I guess the question boils down to,
> Is it better to provide backwards compat by falling back to byte mode for 
> invalid chars,
> or is it better to provide feedback for invalid chars specified in the SET.
> 
> Let's look at FreeBSD for comparison:
> 
> $ export LC_ALL=en_US.UTF-8
> $ printf '\316\243\316\250\n' | tr -d '\316\250'
> ΣΨ
> $ printf '\316\243\316\250\n' | tr -d '\250\316'
> ΣΨ
> $ printf '\316\243\316\250\n' | tr -d $'\316\250'
> Σ
> $ printf '\316\243\316\250\n' | tr -d $'\250\316'
> tr: Illegal byte sequence
> 
> So you can see that there, tr does not concat the octal escapes to multi byte 
> chars.
> Also it doesn't warn about these ineffective specifications that can thus 
> never
> be characters in the input. Also that's in opposition to the POSIX standard 
> you
> linked which states:
> 
> "\octal
> ...
> Multi-byte characters require multiple, concatenated escape sequences of this 
> type,
> including the leading <backslash> for each byte."
> 
> Maybe FreeBSD just ignored this part of the standard due to the backwards
> incompat issue and the ease which one can specify multi-byte chars directly.
> I.E. never treat \octal escapes as part of a multi-byte char, only treat as 
> values?
> 
> Here's another related part of the standard:
> 
> "The earlier version also said that octal sequences referred to collating 
> elements
> and could be placed adjacent to each other to specify multi-byte characters.
> However, it was noted that this caused ambiguities because tr would not be 
> able
> to tell whether adjacent octal sequences were intending to specify multi-byte 
> characters
> or multiple single byte characters. POSIX.1-2008 specifies that octal 
> sequences always
> refer to single byte binary values when used to specify an endpoint of a 
> range of collating elements."
> 
> Right, so I'm leaning towards the FreeBSD behavior and having octal sequences
> always refer to single byte characters.

Thinking a bit more about this,
always interpreting '\316\250' as byte values (like BSD)
doesn't avoid all backwards compat concerns, if you're
going to be processing the input as multibyte chars.
I.E. currently we process in unibyte, resulting in:

  $ octd() { od -to1 --address-radix=none; }
  $ printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dc '\316\250' | octd
  316 316 250

That would change to a noop if processing input as UTF-8 (like BSD).
Changing silently to a noop is problematic.
So maybe we should switch to unibyte processing if any \oct specified?
What if a script is processing a file like tr '\123' '\321' < file.bin > 
file.bin2
It would be nice not to break that.
Also up until now specifying \oct would have meant the user
wanted to process the input as uni-byte as it's not practical
to process multi-byte streams considering bytes individually.
For edge cases like this it might be useful to have a --debug option
that would output a warning if we were ignoring the user's locale charset.
In any case I like the BSD behavior of not treating \oct\oct as multibyte,
especially considering the ambiguity in the last referenced paragraph
of the POSIX standard above.

Related to this is the case where it is valid and optimal to specify certain 
bytes
when processing multi-byte data. It would be a useful optimization to consider
these unibyte ranges, and operate in unibyte mode. I.E. for speed it would be 
worth
processing input as unibyte for `tr -d '\n'` etc.  You can process input in 
unibyte
for all locales if characters being transformed are < 0x30 (limited by GB18030),
or for UTF-8 if characters < 0x7F.

cheers,
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]