[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Pádraig Brady |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Thu, 8 May 2008 14:41:06 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Bruno Haible wrote:
> As a consequence:
> - The number of characters is the same as the number of wide characters.
> - "wc -m" must output the number of characters.
> - In a Unicode locale, <U00E9> is one character, and <U0065><U0301> is
> two characters,
Fair enough.
> If you want wc to count characters after canonicalization, then you can
> invent a new wc command-line option for it.
I guess one would could possibly have --chars={unicode,glyph,grapheme,column}
with unicode being the default, and how it currently works.
> But I would find it more useful
> to have a filter program that reads from standard input and writes the
> canonicalized output to standard output; that would be applicable in many
> more situations.
That would be _very_ useful, yes.
thanks for all the great info in this thread,
Pádraig.
Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08