[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Scan of regexps in Emacs (March 17)
From: |
Mattias Engdegård |
Subject: |
Re: Scan of regexps in Emacs (March 17) |
Date: |
Tue, 2 Apr 2019 16:15:13 +0200 |
2 apr. 2019 kl. 09.33 skrev Paul Eggert <address@hidden>:
>
>> don't we also need a precise description of exactly how they are interpreted
>> by the engine?
>
> In other parts of Emacs, we are typically OK with specs that don't completely
> specify behavior. This gives us more freedom to make changes in the
> undocumented behavior later. I think it makes sense to do that here too, for
> regular expressions like "[z-a-m]" that most readers would find confusing.
Then where does a user go to understand extant regexps? (Do we have any
latitude at all for changing even obscure corners of regexp syntax and
semantics today?) That's why I favour expounding on the details in a separate
section.
>> The terminology is a bit confusing. Is 'raw 8-bit byte' included in
>> 'unibyte'? Is \x7f ever a raw 8-bit byte?
>> I agree that [å-\xff], say, should be invalid but I've never seen such
>> constructs.
>
> After looking into it I realized that I don't really know the semantics here
> (the text I recently added there seems to be wrong, in some cases), and I
> have my doubts that anyone else knows the semantics either. The attached
> patch simply gets rid of that section, leaving the area undocumented. User
> beware!
Apparently I don't really know it either -- I just discovered that:
(string-match "\xff" "\xff") => 0
(string-match "[\xff]" "\xff") => 0
(string-match "\xffé?" "\xff") => nil
(string-match "[\xff]é?" "\xff") => 0
(string-match "\xff" "\xffé") => 0
(string-match "[\xff]" "\xffé") => nil
(string-match "\xffé?" "\xffé") => 0
(string-match "[\xff]é?" "\xffé") => nil
> OK, then we should document z-a as the preferred syntax (best go with the
> flow...). Done in the attached patch.
Actually, the only place where I saw z-a was in auctex (in negated form,
[^z-a]).
>> As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr
>> and found a handful in both Emacs and GNU ELPA, but none of them carried a
>> freeload of bugs. Keeping that check didn't seem worthwhile; the regexps may
>> be a bit odd-looking, but aren't wrong.
>
> It depends on what one means by "wrong". If one wants to use the ranges in
> both Emacs and grep they are "wrong", so it's reasonable for the manual to
> recommend against them.
Definitely agree that it should be discouraged. I've attached the ones found by
a modified relint/xr, in case you are interested.
> It might also help for the trawler to warn about [X-Z] where Z = X+2. [XYZ]
> is clearer and less error-prone than [X-Z]. I shoehorned that into the
> attached patch too.
These seem to be rare; I found exactly one occurrence
(lisp/gnus/message.el:1291):
"[ \t]\\|[][!\"#$%&'()*+,-./0-9;<=>address@hidden|}~]+:"
which uses the punny range ,-. (possibly by benign accident).
Similarly, singleton ranges, X-X, are non-existent save for --- which I presume
is an XEmacs workaround.
The latest xr version warns about 2-character ranges, except within digits
because [0-1] etc was found to be common and harmless.
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 748ab586af..72ee9233a3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
...
+A character alternative can include duplicates. For example,
address@hidden is less clear than @samp{[XYa-z]}.
Certainly, but does this need to be mentioned? Overlapping ranges are rarely
written on purpose. Besides, duplication isn't confined to ranges.
More useful, I think, would be to recommend ranges to stay within natural
sequences (letters, digits, etc) so that a reader needn't consult a table to
see what is included. Thus [0-9.:/] good, [.-:] bad, even though they denote
the same set.
address@hidden
+A @samp{-} also appear at the beginning of a character alternative, or
'appears'
chained-ranges.log
Description: Binary data
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/04/02
- Re: Scan of regexps in Emacs (March 17),
Mattias Engdegård <=
- Re: Scan of regexps in Emacs (March 17), Noam Postavsky, 2019/04/02
- Re: Scan of regexps in Emacs (March 17), Stefan Monnier, 2019/04/02
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/04/02
- Re: Scan of regexps in Emacs (March 17), Mattias Engdegård, 2019/04/06
- Re: Scan of regexps in Emacs (March 17), Michael Albinus, 2019/04/07
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/04/07
- Re: Scan of regexps in Emacs (March 17), Mattias Engdegård, 2019/04/07