bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: terrible Unicode shattering fold(1) command


From: James Youngman
Subject: Re: terrible Unicode shattering fold(1) command
Date: Tue, 26 Aug 2008 19:52:34 +0100

On Mon, Aug 25, 2008 at 11:34 PM,  <address@hidden> wrote:
> Problem 1: Here we see fold -s busy busting apart UTF-8 characters
> again still.

Unless somebody beats me to it, I will try to look at this problem
(though I'm not familiar with how fold is implemented).

> Every third chop falls on a boundary, so is lucky.
>
> No, I did not use `--bytes'. The result is the same as if I did anyway!
>
> Problem 2: Also, when the critical chunk moves past the chopper blade, and we 
> now start
> chopping ASCII, the UTF-8 chopping subsides, but it still chops NOT at
> a blank, yes, that agrees with
>
> `--spaces'
>     Break at word boundaries: the line is broken after the last blank
>     before the maximum line length.  If the line contains no such
>     blanks, the line is broken at the maximum line length as usual.
>
> but don't you see **you leave no option open to respect peoples words
> and not bust them apart**. "So don't use the program" you might say.
> Well, the program works nicely on 99% of the lines. Just please add an
> option to also not bust apart peoples words no matter what!

FWIW, that is hard in languages like Thai, where it's hard to
distinguish which bits are the words and where the reasonable breaks
are.

See for example
http://cpan.uwinnipeg.ca/htdocs/String-Thai-Segmentation/String/Thai/Segmentation.pm.html

James.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]