bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator c

From: Eli Zaretskii
Subject: bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters
Date: Mon, 17 Jul 2017 18:09:46 +0300

Date: Tue, 22 Mar 2016 18:13:15 +0200
From: Eli Zaretskii
> Cc: address@hidden
From: Philipp Stephani
Date: Tue, 22 Mar 2016 11:42:46 +0100
> > 
> > Type some characters
> > Type some more characters
> > M-q
> > 
> > Expected behavior: Emacs treats these characters as line and paragraph
> > separators: they are displayed as line breaks, M-q doesn't remove them,
> > and forward-paragraph etc. treat the paragraph separator as paragraph
> > end.
> > 
> > Actual behavior: These characters are displayed as one-pixel horizontal
> > whitespace and otherwise ignore.
> > 
> > Also discussed in
> > https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
> > https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
> > support for these characters, but I think proper treatment of Unicode
> > separators should be part of Emacs.
> It is not clear to me what exactly is the requested feature.  Can you
> propose a detailed list of requirements?
> I'm asking because these characters come in Unicode with a non-trivial
> baggage, that is a far cry from just breaking the line; see
>   http://unicode.org/reports/tr14/
>   http://unicode.org/reports/tr29/
> There are also implications on the bidirectional display (it is
> sensitive to where the line and the paragraph begin and end).
> If we want to support these two characters, we should think about
> which parts of the relevant functionality we want to see in Emacs,
> because users will expect that.  In addition, there are other
> white-space characters defined by Unicode, and it would make sense to
> treat them all alike.  I'm not sure it makes sense to support just the
> line-breaking and paragraph-separator parts of only these two
> characters.
> Then there are Emacs-specific issues, for example:
>  . do we treat u+2028 and u+2029 as literal characters, or as a form
>    of EOL encoding?
>  . if the former, how do we distinguish them from newlines on display?
>  . should Isearch find these when looking for "\n"? how about regexp
>    search for "$"?
> There are probably more implications, these just the ones that popped
> in my mind in 5 sec.  IOW, I think Someoneā„¢ should think this over and
> present a detailed proposal.

So I've dusted off this year-old bug reported and decided to improve
Emacs in this area.  Here's what I propose:

 . u+2028 and u+2029 (and also perhaps u+0085) will be treated a form
   of EOL encoding, which means they will not appear on display, and
   will cause the next character be displayed on the next screen line
 . M-q will remove u+2028, as it removes newlines, and put newlines
   at all EOLs as part of filling
 . M-q will NOT remove u+2029, unless the user wants to refill several
   paragraphs as a single paragraph, and there happens to be a u+2029
   between some of the paragraphs
 . forward-paragraph etc. will treat u+2029 as paragraph end
 . bidi reordering will treat u+2029 as paragraph end

There are some compromises in these decisions, but they make the job
much easier and less intrusive, and I think they will advance the
level of our Unicode support quite a bit.


I think we should also make $ match these two characters, in addition
to the newline, but that could be more difficult.  Would someone who
knows their way in regex.c want to work on this part?

