[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#9252: a bug in cut
bug#9252: a bug in cut
Sat, 6 Aug 2011 11:19:06 -0600
forcemerge 9252 9253
retitle 9252 cut does not yet support unicode characters
tags 9252 + notabug
Danilo Moraes wrote:
> I have found a little bug (i guess). See that:
Thank you for the report. You have discovered that coreutils does not
yet have localization support for wide characters.
> echo $a | cut -c -5 # shows danil
$ echo "danilo" | od -tx1 -c
0000000 64 61 6e 69 6c 6f 0a
d a n I l o \n
> echo $a | cut -c 5 # shows dáni
I think you meant "cut -c-5" there.
$ echo "dánilo" | od -tx1 -c
0000000 64 c3 a1 6e 69 6c 6f 0a
d 303 241 n I l o \n
As you can see accented characters are not simple single byte
characters. The od output shows their byte values. The accented 'a'
occupies two bytes wide. This is why cut is counting it as two bytes.
> The option -b equal works. The cut is ignoring the letters with acentuation.
Sorry but that code has not yet been written.
> I read in infopages this:
Thank you for consulting the documentation! And I say that
seriously. So many people ignore it. It is pleasant to hear that you
> `-c CHARACTER-LIST'
> Select for printing only the characters in positions listed in
> CHARACTER-LIST. The same as `-b' for now, but
> internationalization will change that. Tabs and backspaces are
> treated like any other character; they take up 1 character. If an
> output delimiter is specified, (see the description of
> `--output-delimiter'), then output that string between ranges of
> selected bytes.
> "The same as `-b' for now, but
> internationalization will change that." this solves my problem? How it
Note that it says "internationalization /will/ change that" which
means will change it in the future. It is a future tense assertion.
It has not yet happened. In the future when the code is written and
put into coreutils then it will do this other behavior.
Note that some software distributions have patches that add unicode
support to the coreutils. But so far none of those patches have been
deemed appropriate to install in the upstream source due to issues of
maintainability due to issues such as code duplication and such.
Because this is not a bug in cut and is also a well known issue I am
going to go ahead and close the report. But that does not mean no
further discussion is possible. Please feel free to respond.
Discussion may still continue and is encouraged.