bug#192: regexp does not work as documented

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#192: regexp does not work as documented

From:	Thomas Lord
Subject:	bug#192: regexp does not work as documented
Date:	Sun, 11 May 2008 20:30:00 -0700
User-agent:	Thunderbird 1.5.0.5 (X11/20060808)

Stefan Monnier wrote:

I have most of the DFA construction code written, but I may take you up
on that anyway.  BTW, regarding the "very expensive in the worst case",
how common is this worst case in real life?


Talking about general purpose use of an engine, as opposed to something
more narrow like "likely font-lock expressions":

It's common enough in "real life" (in my experience) to be exactly the
minimal amount required to make it annoying.

Let me give you some rules of thumb to describe what I mean but I'll
refrain from a long treatise explaining the theory that supports these.
If you have questions I can "unpack" this on list or off.

First, be careful to make a clear distinction between (let's dub the
distinction) "regexps vs. regular expressions".   "Regular expressions"
correspond to formally regular languages.  Regular expressions can
always be converted to a DFA.   "regexps" are what most people use
in most situations.   regexps include things like features to extract the
positions that match a sub-pattern.   regexps do *not* correspond to
regular languages and *can not* always be converted to DFA.   DFA
conversion can help with regexp matching, but it can't solve it completely.
Moreover, you can make pathological regexps (very slow to match) for
every regexp I know of.   Henry Spencer (I've heard) asserts that regexps
are NP complete or something around there, though I haven't seen the
proof.  (Rx is a hybrid regular expression and regexp engine based on an
on-line (caching, incremental) DFA conversion suggested in the "Dragon"
compiler book and originating from (as I recall) an experiment by
Thompson and one of the other Bell Labs guys.  One of whoever it was
privately mentioned that they abandoned it themselves because it was taking
too much code to implement, or something like that.)

That aside, regular expressions are probably plenty for font-lockfunctionality,so let's just talk about those. On-line DFA conversion for those islikely

practical by multiple means.

The size of a DFA is, worst-case, exponential in the size of the regularexpression.This is the source of pathological cases that can thwart any DFAengine. (In exponentialcases, Rx keeps running but more like an NFA, taking a correspondingperformance hit.)

For a time, I was pushing hard on Rx, trying to use it for absolutelyeverything.To make up new problems to solve as in "Ok, let's suppose I can runX-Kbyte longregular expressions with lots of | operators, stars, etc. What can Iapply that to?"

One example is lexing for real-world computer languages.

[Aside: you can also use the same engine as the core of a shift-reduceparser, of course -- and I

did that a long time ago with Guile, to useful effect.]

Almost always, the expressions were such that Rx would have no problemand givevery pleasing results. However, there were definitely times (for hugeregular expressionsand smaller ones) when things would just absolutely crawl. They happen"just often

enough" to be an annoyance.

Now, every case of that annoyance that I found (in real-lifeapplications) had a solution.I could think for a while on *why* the regular expression I wrote wasblowing up andthen think of a different approach that eliminated the problem. Mostoften thatmeant not a different but equivalent regular expression -- it meantgoing one level up andchanging the way the caller used regular expressions. The caller wouldretain the samefunctionality but different demands would be made on the regularexpression (or regexp) engine.

And that makes it tricky to drop *blindly* into Emacs. To solve theunavoidable annoying casesyou really had to know how regular expressions worked at a deep level.To diagnose

and work around the pathological cases took some expertise.

As a general rule, for something "general purpose" like a libcimplementation or thedefault regexp functions in Emacs lisp, there is something to be saidfor using NFA matchers.

It's perverse why:

Vast swaths of things that a DFA matcher can do very quickly an NFAmatcher can not.It's reliably slow. Therefore, people not prepared to think hard aboutregexps tend to usean NFA matcher in pretty limited ways. A powerful regular expressionengine couldsimplify their code, at the cost of risking finding a "hard case" -- butthere's no issue since

people quickly give up and don't try to push the match engine that hard.

More concisely:

If you use a DFA matcher You Better Know What You're Doing but also
Most of the Time You Won't Need To So It's Very Convenient.

If you use an NFA matcher You Can Get Away With Not Really Knowing What

You're Doing but also You Won't Be Trying Anything Fancy, Perhaps LosingOut.

One approach to toss into the mix, this one suggested by Per Bothneryears ago,is to consider *offline* DFA conversion (a la 'lex(1)'). Theadvantage of offline(batch) conversion is that you can burn a lot of cycles on DFAminimization and,if your offline converter terminates, you've got a reliably linearmatcher. Thedisadvantages for *many* uses of regular expressions in Emacs should bepretty obvious.For something like font-lock, where the regular expressions don't changethat often,that might be a good approach -- precompile a minimal DFA and then addsupport

for "regular expression continuations" when using those tables.

-t

[Prev in Thread]

Current Thread

[Next in Thread]

bug#192: regexp does not work as documented, (continued)

Prev by Date: bug#223: 23.0.60; `talk' does not work
Next by Date: bug#225: 23.0.60; tool bar icons missing or executing wrong commands
Previous by thread: bug#192: regexp does not work as documented
Next by thread: bug#192: regexp does not work as documented
Index(es):
- Date
- Thread