[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Scan of regexps in Emacs (March 17)
From: |
Mattias Engdegård |
Subject: |
Re: Scan of regexps in Emacs (March 17) |
Date: |
Thu, 21 Mar 2019 12:15:57 +0100 |
20 mars 2019 kl. 23.01 skrev Paul Eggert <address@hidden>:
>
> On 3/19/19 7:20 PM, Stefan Monnier wrote:
>> I wonder why the doc doesn't just say that `-` should be the last
>> character and not mention the other possibilities which just make the
>> rule unnecessarily complex.
Agreed, that is what the 'how to write regexps' part of the docs should say.
But don't we also need a precise description of exactly how they are
interpreted by the engine? Otherwise, a user cannot read and understand
existing code. (Unless he or she uses xr!) Perhaps there needs to be a separate
'gritty details' section.
> * The doc already says that regular expressions like "*foo" and "+foo"
> are problematic (they're confusing, and POSIX says the behavior is
> undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]"
> and "[[:alpha:]-~]" are problematic in the same way and also should be
> avoided.
I'm with Stefan here; `-' should go last. Anything else is a gritty detail.
> * The doc doesn't clearly say when the Emacs range behavior is an
> extension to POSIX; saying this will help people know better when they
> can export Emacs regular expressions to other programs.
Documenting differences from POSIX regexps is useful. Do you prefer having
those differences being spread out, or all concentrated into one section?
These days, a user may be more familiar with the various PCRE dialects than
traditional or extended POSIX. Should that be taken into account?
> * The doc is confused (and there's a comment about this) about what
> happens when one end of a range is unibyte and the other is multibyte. I
> added something saying that if one bound is a raw 8-bit byte then the
> other should be a unibyte character (either ASCII, or a raw 8-bit byte).
> I don't see any good way to specify the behavior when one bound is a raw
> 8-bit byte and the other bound is a multibyte character, in such a way
> that it's a natural extension of the documented behavior, so the
> documentation now recommends against that.
The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'?
Is \x7f ever a raw 8-bit byte?
I agree that [å-\xff], say, should be invalid but I've never seen such
constructs.
> * We might as well go ahead and say that [b-a] matches nothing, as
> enough code (ab)uses regexps in that way, and there is value in having a
> simple regular expression that always fails to match. However, I expect
> that we should say that users should avoid wilder examples like [~-!] so
> that the trawler can catch them as typos.
It already does, and some bugs were found that way. As a special case, it no
longer complains about z-a because that is unlikely to be an accident and
occurs in some code on purpose.
I'm not sure it's a good idea to document reversed ranges as a recommended way
to match any or no character (although the description of the semantics would
belong in a 'gritty details' section), and only to use [Y-X] where Y=X+1. More
about that in a separate post.
> These new recommendations ("should"s in the attached patch) will give
> the trawler license to diagnose questionable REs like "[a-m-z]",
> "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no
> change to actual Emacs behavior.
As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and
found a handful in both Emacs and GNU ELPA, but none of them carried a freeload
of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit
odd-looking, but aren't wrong.
[!-[:alpha:]] is already detected since xr parses it correctly and will
complain about the duplication of ':'. The reverse, [[:digit:]-z], is seen
occasionally but again does not seem to be a serious bug proxy.
Much as I would like to outlaw ranges where a typical programmer has to consult
an ASCII table to understand what's included, they just seem too common, with
too many false positives, to merit inclusion in xr.
Nevertheless I had a quick look and extracted a few that might merit attention;
see attachment.
Similarly, a rule finding [X-Y] where Y=X+1 found one or two questionable cases
in a sea of false positives (also in the attachment).
possibly-broken-regexps.log
Description: Binary data
- Re: Scan of regexps in Emacs (March 17), (continued)
- Re: Scan of regexps in Emacs (March 17), Mattias Engdegård, 2019/03/19
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/03/19
- Re: Scan of regexps in Emacs (March 17), Stefan Monnier, 2019/03/19
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/03/20
- RE: Scan of regexps in Emacs (March 17), Drew Adams, 2019/03/20
- Re: Scan of regexps in Emacs (March 17), Paul Eggert, 2019/03/20
- Re: Scan of regexps in Emacs (March 17), Eli Zaretskii, 2019/03/20
- RE: Scan of regexps in Emacs (March 17), Drew Adams, 2019/03/21
- Re: Scan of regexps in Emacs (March 17), Eli Zaretskii, 2019/03/21
- Re: Scan of regexps in Emacs (March 17), Stefan Monnier, 2019/03/20
- Re: Scan of regexps in Emacs (March 17),
Mattias Engdegård <=
- Re: Scan of regexps in Emacs (March 17), Richard Stallman, 2019/03/20
- Re: Scan of regexps in Emacs (March 17), Stephen Leake, 2019/03/22
- Re: Scan of regexps in Emacs (March 17), Mattias Engdegård, 2019/03/22
- Re: Scan of regexps in Emacs (March 17), Stefan Monnier, 2019/03/22
- Re: Scan of regexps in Emacs (March 17), Mattias Engdegård, 2019/03/20