bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#192: regexp does not work as documented


From: Thomas Lord
Subject: bug#192: regexp does not work as documented
Date: Sun, 11 May 2008 20:30:00 -0700
User-agent: Thunderbird 1.5.0.5 (X11/20060808)

Stefan Monnier wrote:
I have most of the DFA construction code written, but I may take you up
on that anyway.  BTW, regarding the "very expensive in the worst case",
how common is this worst case in real life?

Talking about general purpose use of an engine, as opposed to something
more narrow like "likely font-lock expressions":

It's common enough in "real life" (in my experience) to be exactly the
minimal amount required to make it annoying.

Let me give you some rules of thumb to describe what I mean but I'll
refrain from a long treatise explaining the theory that supports these.
If you have questions I can "unpack" this on list or off.

First, be careful to make a clear distinction between (let's dub the
distinction) "regexps vs. regular expressions".   "Regular expressions"
correspond to formally regular languages.  Regular expressions can
always be converted to a DFA.   "regexps" are what most people use
in most situations.   regexps include things like features to extract the
positions that match a sub-pattern.   regexps do *not* correspond to
regular languages and *can not* always be converted to DFA.   DFA
conversion can help with regexp matching, but it can't solve it completely.
Moreover, you can make pathological regexps (very slow to match) for
every regexp I know of.   Henry Spencer (I've heard) asserts that regexps
are NP complete or something around there, though I haven't seen the
proof.  (Rx is a hybrid regular expression and regexp engine based on an
on-line (caching, incremental) DFA conversion suggested in the "Dragon"
compiler book and originating from (as I recall) an experiment by
Thompson and one of the other Bell Labs guys.  One of whoever it was
privately mentioned that they abandoned it themselves because it was taking
too much code to implement, or something like that.)

That aside, regular expressions are probably plenty for font-lock functionality, so let's just talk about those. On-line DFA conversion for those is likely
practical by multiple means.

The size of a DFA is, worst-case, exponential in the size of the regular expression. This is the source of pathological cases that can thwart any DFA engine. (In exponential cases, Rx keeps running but more like an NFA, taking a corresponding performance hit.)

For a time, I was pushing hard on Rx, trying to use it for absolutely everything. To make up new problems to solve as in "Ok, let's suppose I can run X-Kbyte long regular expressions with lots of | operators, stars, etc. What can I apply that to?"
One example is lexing for real-world computer languages.

[Aside: you can also use the same engine as the core of a shift-reduce parser, of course -- and I
did that a long time ago with Guile, to useful effect.]

Almost always, the expressions were such that Rx would have no problem and give very pleasing results. However, there were definitely times (for huge regular expressions and smaller ones) when things would just absolutely crawl. They happen "just often
enough" to be an annoyance.

Now, every case of that annoyance that I found (in real-life applications) had a solution. I could think for a while on *why* the regular expression I wrote was blowing up and then think of a different approach that eliminated the problem. Most often that meant not a different but equivalent regular expression -- it meant going one level up and changing the way the caller used regular expressions. The caller would retain the same functionality but different demands would be made on the regular expression (or regexp) engine.

And that makes it tricky to drop *blindly* into Emacs. To solve the unavoidable annoying cases you really had to know how regular expressions worked at a deep level. To diagnose
and work around the pathological cases took some expertise.

As a general rule, for something "general purpose" like a libc implementation or the default regexp functions in Emacs lisp, there is something to be said for using NFA matchers.
It's perverse why:

Vast swaths of things that a DFA matcher can do very quickly an NFA matcher can not. It's reliably slow. Therefore, people not prepared to think hard about regexps tend to use an NFA matcher in pretty limited ways. A powerful regular expression engine could simplify their code, at the cost of risking finding a "hard case" -- but there's no issue since
people quickly give up and don't try to push the match engine that hard.

More concisely:

If you use a DFA matcher You Better Know What You're Doing but also
Most of the Time You Won't Need To So It's Very Convenient.

If you use an NFA matcher You Can Get Away With Not Really Knowing What
You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing Out.

One approach to toss into the mix, this one suggested by Per Bothner years ago, is to consider *offline* DFA conversion (a la 'lex(1)'). The advantage of offline (batch) conversion is that you can burn a lot of cycles on DFA minimization and, if your offline converter terminates, you've got a reliably linear matcher. The disadvantages for *many* uses of regular expressions in Emacs should be pretty obvious. For something like font-lock, where the regular expressions don't change that often, that might be a good approach -- precompile a minimal DFA and then add support
for "regular expression continuations" when using those tables.

-t







reply via email to

[Prev in Thread] Current Thread [Next in Thread]