From b552052f4085e84d662f70bb76cb4abf41ab25bc Mon Sep 17 00:00:00 2001 From: Peter Bex Date: Mon, 5 Jul 2021 11:38:43 +0200 Subject: [PATCH] Update irregex to upstream 960fa22b, fixing a group matching issue When a kleene star is used around an alternative containing submatches, in some circumstances the DFA compilation would emit reordering commands which would cause the regex capturing to go wrong, returning faulty matches. This would go wrong because the ordering commands would read from a memory slot and write to a target memory slot. For example, the following set of reordering commands has no "correct" order in which they can be executed: p[0] <- p[1] p[1] <- p[0] After executing both of them in either order, both of the slots will contain the same value, instead of swapping them as was the intention. This is fixed by executing the ordering commands after first fetching the old memory slot locations into a closure. Fixes upstream issue #27 --- NEWS | 4 +++- irregex-core.scm | 18 ++++++++++++------ tests/re-tests.txt | 1 + 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/NEWS b/NEWS index 46af9bd1..53a40f0f 100644 --- a/NEWS +++ b/NEWS @@ -10,9 +10,11 @@ of irregex-replace/all with positive lookbehind so all matches are replaced instead of only the first (reported by Kay Rhodes), and a regression regarding replacing empty matches which was introduced - by the fixes in 0.9.7 (reported by Sandra Snan). Finally, the + by the fixes in 0.9.7 (reported by Sandra Snan). Also, the http-url shorthand now allows any top-level domain and the old "top-level-domain" now also supports "edu" (fixed by Sandra Snan). + Finally, a problem was fixed with capturing groups inside a kleene + star, which could sometimes return incorrect parts of the match. - current-milliseconds has been deprecated in favor of the name current-process-milliseconds, to avoid confusion due to naming of current-milliseconds versus current-seconds, which do something diff --git a/irregex-core.scm b/irregex-core.scm index 8f672333..a8e7c97f 100644 --- a/irregex-core.scm +++ b/irregex-core.scm @@ -2235,12 +2235,18 @@ (chunk&position (cons src (+ i 1)))) (vector-set! slot (car s) chunk&position))) (cdr cmds)) - (for-each (lambda (c) - (let* ((tag (vector-ref c 0)) - (ss (vector-ref memory (vector-ref c 1))) - (ds (vector-ref memory (vector-ref c 2)))) - (vector-set! ds tag (vector-ref ss tag)))) - (car cmds))))) + ;; Reassigning commands may be in an order which + ;; causes memory cells to be clobbered before + ;; they're read out. Make 2 passes to maintain + ;; old values by copying them into a closure. + (for-each (lambda (execute!) (execute!)) + (map (lambda (c) + (let* ((tag (vector-ref c 0)) + (ss (vector-ref memory (vector-ref c 1))) + (ds (vector-ref memory (vector-ref c 2))) + (value-from (vector-ref ss tag))) + (lambda () (vector-set! ds tag value-from)))) + (car cmds)))))) (if new-finalizer (lp2 (+ i 1) next src (+ i 1) new-finalizer) (lp2 (+ i 1) next res-src res-index #f)))) diff --git a/tests/re-tests.txt b/tests/re-tests.txt index 7a56edb7..39a747e6 100644 --- a/tests/re-tests.txt +++ b/tests/re-tests.txt @@ -171,3 +171,4 @@ multiple words multiple words, yeah y & multiple words (a([^a])*)* abcaBC y &-\1-\2 abcaBC-aBC-C ([Aa]b).*\1 abxyzab y &-\1 abxyzab-ab a([\/\\]*)b a//\\b y &-\1 a//\\b-//\\ +(?:[[:alnum:]]|(@[[:alnum:]]))* oeh@2tu@2n342 y \1 @2 -- 2.20.1