[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20751: wc -m doesn't count UTF-8 characters properly
From: |
Glenn Morris |
Subject: |
bug#20751: wc -m doesn't count UTF-8 characters properly |
Date: |
Sat, 06 Jun 2015 14:10:23 -0400 |
User-agent: |
Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/) |
You mailed address@hidden without specifying a Package:, so your bug
report ended up on the help-debbugs list. I have reassigned it to
coreutils. (Please note there is no "wc" package.)
(My mailer is messing up the UTF-8 characters in your report.
Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)
Valdis V toli wrote:
> Version: wc (GNU coreutils) 8.21
>
> When 'wc -m' is invoked, it should print character count, but it counts
> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> bytes in them, but all have only two UTF-8 encoded characters, which you
> can see with any modern text editor.
>
> wc -c chows correct number of bytes:
> wc -c *
> 3 3bytes.txt
> 4 4bytes.txt
> 6 6bytes.txt
> 13 total
>
> But wc -m shows incorrect number of characters:
> wc -m *
> 3 3bytes.txt
> 3 4bytes.txt
> 3 6bytes.txt
> 9 total
>
> But should be:
> wc -m *
> 2 3bytes.txt
> 2 4bytes.txt
> 2 6bytes.txt
> 6 total
>
> I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
> GNU/Linux 3.13.0-53-generic kernel
>
> P.S.
> If attachments will not pass through system, you can test it by creating
> files with following content:
>
> 3bytes.txt: aa
> 4bytes.txt: aÄ
> 6bytes.txt: að
Attachments at http://debbugs.gnu.org/20751#5
- bug#20751: wc -m doesn't count UTF-8 characters properly,
Glenn Morris <=