[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9252: a bug in cut

From: Bob Proulx
Subject: bug#9252: a bug in cut
Date: Sat, 6 Aug 2011 11:19:06 -0600
User-agent: Mutt/1.5.21 (2010-09-15)

forcemerge 9252 9253
retitle 9252 cut does not yet support unicode characters
tags 9252 + notabug
close 9252

Danilo Moraes wrote:
> I have found a little bug (i guess). See that:

Thank you for the report.  You have discovered that coreutils does not
yet have localization support for wide characters.

> a=danilo
> echo $a | cut -c -5 # shows danil

  $ echo "danilo" | od -tx1 -c
  0000000  64  61  6e  69  6c  6f  0a
            d   a   n   I   l   o  \n

> a=dánilo
> echo $a | cut -c 5 # shows dáni

I think you meant "cut -c-5" there.

  $ echo "dánilo" | od -tx1 -c
  0000000  64  c3  a1  6e  69  6c  6f  0a
            d 303 241   n   I   l   o  \n

As you can see accented characters are not simple single byte
characters.  The od output shows their byte values.  The accented 'a'
occupies two bytes wide.  This is why cut is counting it as two bytes.

> The option -b equal works. The cut is ignoring the letters with acentuation.

Sorry but that code has not yet been written.

> I read in infopages this:

Thank you for consulting the documentation!  And I say that
seriously.  So many people ignore it.  It is pleasant to hear that you
read it.

> `--characters=CHARACTER-LIST'
>      Select for printing only the characters in positions listed in
>      CHARACTER-LIST.  The same as `-b' for now, but
>      internationalization will change that.  Tabs and backspaces are
>      treated like any other character; they take up 1 character.  If an
>      output delimiter is specified, (see the description of
>      `--output-delimiter'), then output that string between ranges of
>      selected bytes.
> "The same as `-b' for now, but
>      internationalization will change that." this solves my problem? How it
> works?

Note that it says "internationalization /will/ change that" which
means will change it in the future.  It is a future tense assertion.
It has not yet happened.  In the future when the code is written and
put into coreutils then it will do this other behavior.

Note that some software distributions have patches that add unicode
support to the coreutils.  But so far none of those patches have been
deemed appropriate to install in the upstream source due to issues of
maintainability due to issues such as code duplication and such.

Because this is not a bug in cut and is also a well known issue I am
going to go ahead and close the report.  But that does not mean no
further discussion is possible.  Please feel free to respond.
Discussion may still continue and is encouraged.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]