bug#26362: tr -cd -- Problem with UTF-8?

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26362: tr -cd -- Problem with UTF-8?

From:	Assaf Gordon
Subject:	bug#26362: tr -cd -- Problem with UTF-8?
Date:	Tue, 4 Apr 2017 22:19:15 -0400

tags 26362 notabug wishlist
stop 26362

Hello,

> On Apr 4, 2017, at 10:01, Ronald Schaten <address@hidden> wrote:
> 
> I'm not sure if this is bug or if I'm using it wrong.

Neither - it is simply the GNU tr does not yet support multibyte characters.

> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
> 
> $ echo -ne "\xc3\x82" | tr -cd "ä" | xxd
> % 00000000: c3                                       .
> 
> The echo prints a capital A with a circumflex (Â), and I expect the tr
> command to delete everything except the small umlaut ä. It looks as if
> tr just deletes the second byte.

What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.

> When I try without the umlaut it gives me the empty result, as expected:
> 
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd

Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.

> For the moment, I'll try to solve my problem differently, but... is this
> a bug? Thanks in advance!

Not a bug - but a yet-missing feature.
For relevant discussion see here:
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=24924#8

As a temporary work-around, you can use gnu sed which is multibyte-aware:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^ä]//g'
  ä

And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are equivalent to 
'a'
will be deleted:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=a=]]//g'
  aäÂ

'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.

Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode character,
e.g.:
     printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if they
support this character, not just UTF-8).

I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.

regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#26362: tr -cd -- Problem with UTF-8?, Ronald Schaten, 2017/04/04
- bug#26362: tr -cd -- Problem with UTF-8?, Assaf Gordon <=

Prev by Date: bug#26363: [PATCH] tail: 'tail -F dir/file' reverts to polling mode if 'dir' is removed
Next by Date: bug#26364: [PATCH] Handle possible NULL return value of tzalloc.
Previous by thread: bug#26362: tr -cd -- Problem with UTF-8?
Next by thread: bug#26363: [PATCH] tail: 'tail -F dir/file' reverts to polling mode if 'dir' is removed
Index(es):
- Date
- Thread