Re: tr is handling bytes not characters

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tr is handling bytes not characters

From:	Nick Demou
Subject:	Re: tr is handling bytes not characters
Date:	Tue, 10 Feb 2009 18:06:00 +0200

On Tue, Feb 10, 2009 at 12:59 PM, Jim Meyering <address@hidden> wrote:
> Nick Demou <address@hidden> wrote:
>> [...]
>> Thanks for the info Eric. I was almost sure this would be the case. In
>> fact I don't consider this as the main topic of my bug report. The
>> main topic for me is the documentation. The man and info page don't
>> make it clear that utf-8 is not supported. I believe that others after
>> me will spend a lot of time just to realize that "it's just a missing
>> feature".  Do you have any thoughts regarding my suggestions on the
>> documentation?
>
> The "real" documentation is in coreutils.texi (generated to
> coreutils.info and available via "info coreutils").  There,
> under "tr invocation", it already has this caveat:

oops, mea culpa
I did read carefully the man page and then I did search coreutils info
before submitting this bug report. However I only searched for "utf"
and "unicode" so I missed the warning which doesn't contain any of the
two strings

> and since "man tr" does point to the authoritative source [the info pages]:
> [...]
> that may be enough.

I think it is for English speaking users but not for non-English
speaking ones who have to deal with actual[1] UTF8 text often. I would
suggest the following small corrections:

A. for the info page
====================

add a direct reference to UTF-8 and Unicode like this:

from:
#   Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters;

to:
#   Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters (e.g. UTF-8
# or UTF-16 encoded Unicode characters);

B. for the man page
===================

add a reference like this:

#  Currently `tr' fully supports only single-byte characters.
# (a notable example of multibyte characters that are not
# supported are UTF-8 and UTF-16 encoded Unicode characters)

C. for the core utils FAQ
=========================

add a Question like this one:

# Q: What's the status of Unicode support.

(for which I cannot suggest a thorough answer although I could try and
dig something out of the current documentation if noone else is able
to help at the moment)

or

# Q: I get funny/no/wrong results when dealing with
#    UTF-8/Unicode input

# A: UTF-8 and UTF-16 encodings for Unicode text is made up
#    of multibyte characters which are not well supported
#    by some coreutils programs.

___________________
[1] UTF-8 above the ASCII char set

--
"The software is licensed, not sold" -- MICROSOFT LICENSE TERMS

[Prev in Thread]

Current Thread

[Next in Thread]

tr is handling bytes not characters, Nick Demou, 2009/02/05
- Re: tr is handling bytes not characters, Eric Blake, 2009/02/05
  - Re: tr is handling bytes not characters, Nick Demou, 2009/02/06
    - Re: tr is handling bytes not characters, Jim Meyering, 2009/02/10
    - Re: tr is handling bytes not characters, Nick Demou <=
    - Re: tr is handling bytes not characters, Jim Meyering, 2009/02/11

Prev by Date: Re: [PATCH] tests: Avoid skipping stty-* tests.
Next by Date: Re: Bugs in unexpand(1) version 6.10
Previous by thread: Re: tr is handling bytes not characters
Next by thread: Re: tr is handling bytes not characters
Index(es):
- Date
- Thread