[debbugs-tracker] bug#29606: closed (Command 'fold' dangerous with utf-8

emacs-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#29606: closed (Command 'fold' dangerous with utf-8

From:	GNU bug Tracking System
Subject:	[debbugs-tracker] bug#29606: closed (Command 'fold' dangerous with utf-8 input)
Date:	Sat, 09 Dec 2017 03:16:02 +0000

Your message dated Fri, 8 Dec 2017 20:15:12 -0700
with message-id <address@hidden>
and subject line Re: bug#29606: Command 'fold' dangerous with utf-8 input
has caused the debbugs.gnu.org bug report #29606,
regarding Command 'fold' dangerous with utf-8 input
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
29606: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=29606
GNU Bug Tracking System
Contact address@hidden with problems

--- Begin Message --- Subject: Command 'fold' dangerous with utf-8 input Date: Thu, 7 Dec 2017 11:10:02 +0100 (CET) User-agent: Alpine 2.02 (DEB 1266 2009-07-14)
Dear maintainers,

I am using fold version 8.13 on a Debian 3.2.93-1
cat filename | fold
If 'filename' contains utf8 characters consisting of more than one byte,fold will consider breaking the line inside such a character. There is nooption to stop it doing that.
Except, of course "-s": break at spaces. But that may not be what the userwants.
According to man-page, it counts columns by default, not bytes. This seemsnot to be true. The switch "-b": count bytes, has no influence on theoutput in my test case.
How to fix this?
I presume that either (1) the default behavior (counting columns) is notwhat I expect, namely to count characters instead of bytes. This wouldhave to be clarified in man-page.
or (2) that the default isn't what the man-page says it is: possibly thedefault set in the code is to count bytes. This would be an error.
or (3) that 'fold' fails to read my "LANG" environment variable whichclearly states a UTF-8 locale. This, in 2017, is an error.
Please write back to address@hidden if you need exampledata or clarifications.
Thank you,
Mark Roberts
--- End Message ---

--- Begin Message --- Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input Date: Fri, 8 Dec 2017 20:15:12 -0700 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0
Hello Mark,

First,
thank you for taking the time and effort
to test our development snapshot, and reporting results back.
This kind of feedback is critical in getting multibyte support ready.


Second,
I can confirm the behavior you are observing, reproduced here
with 'od' for easier output:

## POSIX single-byte locale:

$ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
 303  \n 237  \n
$ echo "ß" | LC_ALL=C src/fold         --width 1 | od -tc -An
 303  \n 237  \n

## UTF8 locale:

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
 303 237  \n

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold         --width 1 | od -tc -An
 303 237  \n


On 2017-12-08 05:04 AM, Mark Roberts wrote:
When --bytes is not specified, the program treats '\b', '\r' and '\t'specially. It assumes a tab width of eight (compile-time #define) andattempts to keep track of what the output will look like.
This is absolutely not what I expected.
That is correct, and I share your sentiment: it also took me some time
to try and track down why it behaves this way, and whether it's bydesign or a bug.
But of course, when the programwas first written, the words byte and character meant the same thing forprintable characters. Printable bytes.
The reasoning for this behavior is explained in the OpenGroup's POSIXstandard page for fold, in the "RATIONAL" section:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18

There, it is made clear:
  "Historical versions of the fold utility assumed 1 byte was one
  character and occupied one column position when written out. This is
  no longer always true.
  [....]
  Note that although the width for the -b option is in bytes, a line is
  never split in the middle of a character."
Therefore, the current implementation (of the development version) iscorrect.
I will attempt to suggest an improved text for the man-page so thatothers will not be surprised.
I agree that once multibyte support is added to fold(1), the man pages,
the help screen and texi manual must be updated to clearly
indicate the "-b/--bytes" only applies to \b \t \r and never to
multibyte characters.

If you find the time to send such a patch - great!
If not, I will add it sooner or later (hopefully sooner).

As such I'm closing this bug report, but further discussion (and
patches) are welcomed by replying to this thread.

regards,
 - assaf
--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

[debbugs-tracker] bug#29606: closed (Command 'fold' dangerous with utf-8 input), GNU bug Tracking System <=

Prev by Date: [debbugs-tracker] bug#29591: closed ([PATCH] doc: Fix typo.)
Next by Date: [debbugs-tracker] Processed: control message for bug #29544
Previous by thread: [debbugs-tracker] bug#29591: closed ([PATCH] doc: Fix typo.)
Next by thread: [debbugs-tracker] Processed: control message for bug #29544
Index(es):
- Date
- Thread