[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-hackers] [PATCH] Fix #878

From: Peter Bex
Subject: [Chicken-hackers] [PATCH] Fix #878
Date: Sun, 22 Jun 2014 18:23:11 +0200
User-agent: Mutt/

Hi all,

I had a look at #878, and though this is mind-bending code (at least,
for my feeble mind) it's pretty clear what's gone wrong if you compare
the code to the reference implementation.

Of course, the question then becomes which is right, ours or Olin's.
To verify that more easily because these vectors are a little hard
to grok directly, I made a test case which is easier to step through
in your head.

The idea behind kmp-search is explained in the SRFI document, but it
was pretty unclear to me all this time, until I was "forced" to
understand for this bug, so let me try and explain it a bit here.

The idea is that you look at a search string character-by-character
without any backtracking.  The pattern string you're matching against
then remembers some state so that it knows where in the pattern you are.

For example, if we match "abcabe" against the pattern "abe", we get this:

start: _abe
read a => a_be
read b => ab_e
read c => _abe  (reset)
read a => a_be
read b => ab_e
read e => abe_ (done)

We know we're done because we arrived at the pattern's final index.
If there's still data waiting on the stream, we need to decide how
the search is anchored.  This is essentially the same as the DFA regex
matching performed in irregex; it's a simple state machine.

Where it gets trickier is when the pattern contains repetitions.  For
example, the pattern "abac" matches against search string "ababac" like

start: _abac
read a => a_bac
read b => ab_ac
read a => aba_c
read b => ab_ac (reset, but to position 2!)
read a => aba_c
read c => abac_ (done)

As you can see, the "magic" here is in how it determines to jump back to
position 2, it somehow "knows" that the string "ab" is a suffix of the
complete string read so far, so it won't have to re-check that part of
the pattern.  The restart-vector is simply a representation of the state
transition diagram for this state machine.

The code which constructs this restart vector is mind-bending, and I
still don't fully understand it.  It somehow seems to loop in parallel,
simultaneously stepping through three indices.  The only really simple
bit is the "k" variable, which is the "main" loop, through the pattern
like k = start..end.  At the same time it loops through the vector using
i = 0..len and using j, which uses the positions in the restart vector
constructed so far (between 0..i, or at least that's what I *think*).

How and what exactly doesn't matter so much right now, because the tests
I added clearly show that the current CHICKEN code is simply wrong.  The
reference implementation passes the tests, and really differs only in two
places, where the ck variable is used (which does not exist in the
reference implementation).  ck represents the character at position k in
the pattern, but the problem is that in the reference implementation it
indexes the string in one place at k+1 and in the other place at k.  The
CHICKEN version indexes the string in both places at k.  The patch
simply reverts this attempted optimisation to behave like the reference
implementation again.


Attachment: 0001-Fix-878-which-was-indeed-a-bug-caused-by-an-incorrec.patch
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]