bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37659: rx additions: anychar, unmatchable, unordered-or


From: Mattias Engdegård
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 23 Oct 2019 11:15:47 +0200

22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert@cs.ucla.edu>:

>> Thus, instead of 'unordered-or', define the operator in terms of long 
>> matches: 'or-max' (working name) would work like 'or' but guarantee a 
>> longest match, and only permit strings and 'or-max' forms as arguments.
> 
> That's an odd restriction. I'm not sure it's a good idea to add an operator 
> with such a restriction. That is, I know why the restriction is there (it's 
> because of limitations in the Emacs regexp matcher), but it's not clear that 
> users should have to know and understand these details.

The restriction is simple and easy to document. It is not necessary to know the 
underlying reason for it in order to use the construct effectively.

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't 
> plain "or" be greedy, to be consistent with other operators?

Yes, I very much favour switching to a DFA engine; is there another way? Even 
then a backtracking engine would be needed for backrefs and other messy cases. 
However, that's a completely different amount of work. (Meanwhile, we have 
'posix-string-match' etc for those who want greed at any cost.)

The problem that I'm trying to solve here is: how do we make it easy to match 
one of multiple strings --- keywords, say --- in rx? Currently, the answer is 
something like (regexp (regexp-opt my-keywords)), which doesn't integrate well 
with rx user definitions. In addition, the output of one regexp-opt cannot be 
used as input to another.

'or-max' would allow a user to say

(rx-define veggies (or-max "carrot" "tomato" "cucumber"))
(rx-define meats (or-max "beef" "chicken" "pork"))
... (rx (or-max veggies meats)) ...

and get a regexp that is guaranteed to be greedy, well-optimised as if all 
strings were passed to 'regexp-opt' at once, and robust: a small change won't 
change the behaviour radically, and the user won't have to game or second-guess 
the engine in order to produce the desired result.

If, in the future, 'or' becomes greedy, then 'or-max' will just be a synonym.

> If it's too much trouble to make plain "or" greedy, I suggest just 
> documenting it as possibly being greedy and possibly not (that is, document 
> it as being unordered, even if it happens to be ordered now). This will give 
> us more opportunity for optimization later.

That would make rx strictly less useful than string regexps. That is why 
'unordered-or' was a mistake: the unpredictability made it useless in many 
cases, and everyone would just have used regexp-opt (or skipped rx altogether).

It is desirable to have the semantics for 'or' in rx and \| in string regexps; 
otherwise, translating and understanding become unnecessarily difficult.

We could say that 'or' and \| either match greedily or in left-to-right order. 
However, I'm not sure this solves any problem right now.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]