bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Pádraig Brady
Subject: Re: horrible utf-8 performace in wc
Date: Wed, 7 May 2008 16:33:00 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bo Borgerson wrote:
> Jim Meyering wrote:
>> Bo Borgerson <address@hidden> wrote:
>>> I may be misinterpreting your patch, but it seems to me that
>>> decrementing count for zero-width characters could potentially lead to
>>> confusion.  Not all zero-width characters are combining characters, right?
>> It looks ok to me, since there's an unconditional increment
>>
>>                chars++;
>>
>> about 25 lines above, so the decrement would just undo that.
> 
> 
> Right, I guess my question is more about the semantics of `wc -m'.
> Should stand-alone zero-width characters such as the zero-width space be
> counted?
> 
> The attached (UTF-8) file contains 3 characters according to HEAD, but
> only two with the patch.

Interesting, I thought of that myself
but assumed iswspace(u"zero-width space") == 1
Actually there are no chars where:
  wcwidth(char)==0 && iswspace(char)==1

In the first 65535 code points there are also 404 chars which are
not classed as combining in the unicode database, but are classed
as zero width in the glibc locale data at least (zero-width space
being one of them like you mentioned). I determined this with the
attached progs:

./zw | python unidata.py | grep " 0 " | wc -l

So I suggest that we don't merge my tweak as is. What we could do is:
1. Find a method to distinguish the above 404 characters at least.
2. Define -m to mean "individual displayable characters" if this is
   what people usually want.
3. Add a new option for this.

Pádraig.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>
#include <locale.h>

int main(int argc, char** argv) {

    /* This is a single threaded app, so mark as such for performance. */
    #include <stdio_ext.h>
    __fsetlocking(stdin,FSETLOCKING_BYCALLER);
    __fsetlocking(stdout,FSETLOCKING_BYCALLER);

    if (!setlocale(LC_CTYPE, "")) { //TODO: What about LC_COLLATE?
       fprintf(stderr,"Warning locale not supported by glibc, using 'C' 
locale\n");
    }

    wchar_t wc;
    for (wc=0; wc<=0xFFFF; wc++) {
        if (!wcwidth(wc)) {
            printf("%04X\n",wc);
        }
    }
}
import unicodedata,sys

for char in sys.stdin:
    char = char[:-1]
    c = unichr(int(char,16))
    try:
        print char, int(unicodedata.combining(c)!=0), unicodedata.name(c)
    except:
        print

reply via email to

[Prev in Thread] Current Thread [Next in Thread]