[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: tr is handling bytes not characters
From: |
Nick Demou |
Subject: |
Re: tr is handling bytes not characters |
Date: |
Tue, 10 Feb 2009 18:06:00 +0200 |
On Tue, Feb 10, 2009 at 12:59 PM, Jim Meyering <address@hidden> wrote:
> Nick Demou <address@hidden> wrote:
>> [...]
>> Thanks for the info Eric. I was almost sure this would be the case. In
>> fact I don't consider this as the main topic of my bug report. The
>> main topic for me is the documentation. The man and info page don't
>> make it clear that utf-8 is not supported. I believe that others after
>> me will spend a lot of time just to realize that "it's just a missing
>> feature". Do you have any thoughts regarding my suggestions on the
>> documentation?
>
> The "real" documentation is in coreutils.texi (generated to
> coreutils.info and available via "info coreutils"). There,
> under "tr invocation", it already has this caveat:
oops, mea culpa
I did read carefully the man page and then I did search coreutils info
before submitting this bug report. However I only searched for "utf"
and "unicode" so I missed the warning which doesn't contain any of the
two strings
> and since "man tr" does point to the authoritative source [the info pages]:
> [...]
> that may be enough.
I think it is for English speaking users but not for non-English
speaking ones who have to deal with actual[1] UTF8 text often. I would
suggest the following small corrections:
A. for the info page
====================
add a direct reference to UTF-8 and Unicode like this:
from:
# Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters;
to:
# Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters (e.g. UTF-8
# or UTF-16 encoded Unicode characters);
B. for the man page
===================
add a reference like this:
# Currently `tr' fully supports only single-byte characters.
# (a notable example of multibyte characters that are not
# supported are UTF-8 and UTF-16 encoded Unicode characters)
C. for the core utils FAQ
=========================
add a Question like this one:
# Q: What's the status of Unicode support.
(for which I cannot suggest a thorough answer although I could try and
dig something out of the current documentation if noone else is able
to help at the moment)
or
# Q: I get funny/no/wrong results when dealing with
# UTF-8/Unicode input
# A: UTF-8 and UTF-16 encodings for Unicode text is made up
# of multibyte characters which are not well supported
# by some coreutils programs.
___________________
[1] UTF-8 above the ASCII char set
--
"The software is licensed, not sold" -- MICROSOFT LICENSE TERMS