chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-hackers] irregex speed notes


From: Alex Shinn
Subject: Re: [Chicken-hackers] irregex speed notes
Date: Fri, 10 Apr 2009 12:21:49 +0900
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

Jim Ursetto <address@hidden> writes:

> It's been pointed out on IRC, but I wanted to mention here (in case it
> is not general knowledge) that the transition from PCRE to irregex
> imposed a severe performance penalty.
>
> This is on a 400MB Apache log file, run like this:
>
> $ csc -O2 -o a a.scm; ./a -:h4m
> $ irb a.rb
>
> Chicken 3.4.0: 11s
> Chicken 3.5.2: 2m 20s
> Chicken 4.0.0: 2m 15s
> Ruby 0.9: 4s
>
> If anyone is interested in optimization, please direct your efforts here. :)

Thanks for this sample case, I will keep it (and log parsing
in general) in mind when I do benchmarks and optimizing.

For another comparison, per http://swtch.com/~rsc/regexp/regexp1.html

  (define (make-pat n)
    (let ((s (make-string (* 3 n) #\a)))
      (do ((i 1 (+ i 2)))
          ((>= i (* n 2)) s)
        (string-set! s i #\?))))

  (string-match (make-pat n) (make-string n #\a))

PCRE will take exponentially long in n, and IrRegex will
return immediately.  (Actually, I think depending on the
version of PCRE it just crashed or returns incorrect
results.)

So, one of the reasons I haven't done any (nothing, nada)
optimizations for IrRegex yet is because I want the big
picture.  I want to know what all the corner cases and
common cases are, so that when I do speed things up I can
have a nice side-by-side comparison and see that my efforts
were worthwhile.

Also, any fine tuning now will just make it harder to
maintain in the interim (especially the more Chicken
diverges from upstream), and harder to switch to
algorithmically faster alternatives (such as vector lookups
instead of alists for state transitions).

There are also a number of different things to optimize:

  * regex compilation (can be worked around by pre-compiling)
  * regex representation (use less memory)
  * handling more patterns as DFA (important!)
  * bumming the matcher inner loops (easy)
  * algorithmic improvements like table lookups (memory trade-off)
  * bumming and special-casing the closure-compiled matcher
  * optimizing match-data extraction
  * making regexen serializable

and some of these conflict with each other.

Now, handling more patterns as DFA is important, but
difficult.  For example, in Jim's script one of the patterns
used ^, which translates to the `bol' SRE (beginning of
line).  That's actually a zero-width look-behind assertion,
and results in using the closure-compiled matcher, not using
the DFA at all (I checked with Jim on IRC and since that
only affects a small percent of the lines it wasn't the
issue, but is something to be aware of).

If you want to match the beginning of the string use \A (or
the `bos' SRE).  There are special case optimizations that
let us represent that as a DFA.  It may be possible to add
similar optimizations to compile ^ to a DFA, but I don't
want to do so at the expense of complicating the DFA
representation (which, again, would make overall algorithmic
optimizations harder to add).

As another example, {n,m} range patterns are not currently
converted to DFAs, though they trivially could be.  They
will tend to generate fairly large DFAs, though, and if it
gets large enough IrRegex will give up on the DFA and switch
to closure compilation anyway, which is why I didn't bother
initially.  But since IrRegex _is_ smart enough to give up
with DFAs start to get too large, it's probably better to at
least try by default.

But, ultimately, I really just want to write up some
comprehensive benchmarks to see what's going on.

-- 
Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]