[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator c
bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters
Mon, 17 Jul 2017 18:09:46 +0300
> Date: Tue, 22 Mar 2016 18:13:15 +0200
> From: Eli Zaretskii <address@hidden>
> Cc: address@hidden
> > From: Philipp Stephani <address@hidden>
> > Date: Tue, 22 Mar 2016 11:42:46 +0100
> > Type some characters
> > C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR)
> > Type some more characters
> > M-q
> > Expected behavior: Emacs treats these characters as line and paragraph
> > separators: they are displayed as line breaks, M-q doesn't remove them,
> > and forward-paragraph etc. treat the paragraph separator as paragraph
> > end.
> > Actual behavior: These characters are displayed as one-pixel horizontal
> > whitespace and otherwise ignore.
> > Also discussed in
> > https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
> > https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
> > support for these characters, but I think proper treatment of Unicode
> > separators should be part of Emacs.
> It is not clear to me what exactly is the requested feature. Can you
> propose a detailed list of requirements?
> I'm asking because these characters come in Unicode with a non-trivial
> baggage, that is a far cry from just breaking the line; see
> There are also implications on the bidirectional display (it is
> sensitive to where the line and the paragraph begin and end).
> If we want to support these two characters, we should think about
> which parts of the relevant functionality we want to see in Emacs,
> because users will expect that. In addition, there are other
> white-space characters defined by Unicode, and it would make sense to
> treat them all alike. I'm not sure it makes sense to support just the
> line-breaking and paragraph-separator parts of only these two
> Then there are Emacs-specific issues, for example:
> . do we treat u+2028 and u+2029 as literal characters, or as a form
> of EOL encoding?
> . if the former, how do we distinguish them from newlines on display?
> . should Isearch find these when looking for "\n"? how about regexp
> search for "$"?
> There are probably more implications, these just the ones that popped
> in my mind in 5 sec. IOW, I think Someone™ should think this over and
> present a detailed proposal.
So I've dusted off this year-old bug reported and decided to improve
Emacs in this area. Here's what I propose:
. u+2028 and u+2029 (and also perhaps u+0085) will be treated a form
of EOL encoding, which means they will not appear on display, and
will cause the next character be displayed on the next screen line
. M-q will remove u+2028, as it removes newlines, and put newlines
at all EOLs as part of filling
. M-q will NOT remove u+2029, unless the user wants to refill several
paragraphs as a single paragraph, and there happens to be a u+2029
between some of the paragraphs
. forward-paragraph etc. will treat u+2029 as paragraph end
. bidi reordering will treat u+2029 as paragraph end
There are some compromises in these decisions, but they make the job
much easier and less intrusive, and I think they will advance the
level of our Unicode support quite a bit.
I think we should also make $ match these two characters, in addition
to the newline, but that could be more difficult. Would someone who
knows their way in regex.c want to work on this part?
|[Prev in Thread]
||[Next in Thread]|
- bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters,
Eli Zaretskii <=