bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34524: wc: word count incorrect when words separated only by no-brea


From: Paul Eggert
Subject: bug#34524: wc: word count incorrect when words separated only by no-break space
Date: Sun, 24 Feb 2019 09:47:02 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

Bruno Haible wrote:
I would find it best to introduce an option '--unicode'
to 'wc', that would produce Unicode compliant results, at the cost of
   - not following POSIX to the letter,

It'd make sense to have an option. How about a more-general option --words, that would let the user define what a word is? This option's operand could use ERE syntax, or a shorthand beginning with '+' for common combinations. For example, the command:

wc --words='[[:alnum:]]+'

would say that a word consists of the longest contiguous sequence of alphanumeric characters. And

wc --words='+unicode'

would use the Unicode definition of word, whatever it is.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]