bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Cut not working with multi-byte UTF-8 characters


From: Patrik Hirvinen
Subject: Cut not working with multi-byte UTF-8 characters
Date: Sun, 09 Jul 2006 17:48:00 +0300
User-agent: Mozilla Thunderbird 1.0.8 (X11/20060502)

Hi,

This bug was found on an Ubuntu 5.10 GNU/Linux x86 using cut version 5.2.1. Locale used was en_US.UTF-8.

When fed text that includes multi-byte characters, cut makes the assumption that one byte corresponds to one character, even though the locale would clearly suggest otherwise.

Attached is an example file, containing in UTF-8 format the character or Unicode code point U+00E4 and a newline, or in hexadecimal, "0xc3a40a". "cut -c 1 example.bin" should thus produce 'รค', yet it's output is identical to "cut -b 1 example.bin", not "cut -b 2 example.bin" as it should be.

Thanks

Patrik Hirvinen
address@hidden
+358-(0)40-7186320

Attachment: example.bin
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]