[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-hackers] [PATCH][5] Problem with utf8 character classes in
From: |
Evan Hanson |
Subject: |
Re: [Chicken-hackers] [PATCH][5] Problem with utf8 character classes in irregex |
Date: |
Sun, 12 Nov 2017 09:52:35 +1300 |
Applied.
Thanks to all involved.
Evan
On 2017-11-10 20:55, Peter Bex wrote:
On Fri, Nov 10, 2017 at 10:59:26AM +0100, Peter Bex wrote:
Just as a heads-up, I'd like to wait applying this until it has hit
upstream, in case Alex decides to fix it differently. I do think
it should go into 4.13.0.
OK, this has now landed in the upstream code. Attached are signed-off
copies for master and chicken-5. I've also updated NEWS.
Cheers,
Peter
From 9aa8bbc642d1b4bb9870327dd3dff6e200f0bd27 Mon Sep 17 00:00:00 2001
From: LemonBoy <address@hidden>
Date: Thu, 9 Nov 2017 13:29:08 +0100
Subject: [PATCH] Fix an error in unicode-range->utf8-pattern
The sequence generated for a utf8 character class contained an
unintended trailing '(), causing the code to fail when
`sre-length-ranges' is called.
Reported by Chunyang Xu at CHICKEN-users.
Signed-off-by: Peter Bex <address@hidden>
---
NEWS | 2 ++
irregex-core.scm | 7 +++----
tests/test-irregex.scm | 2 ++
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/NEWS b/NEWS
index 397fd532..46d2c0eb 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,8 @@
on s8vectors (thanks to Kristian Lein-Mathisen).
- Large literals no longer crash with "invalid encoded numeric literal"
on mingw-64 (#1344, thanks to Lemonboy).
+ - Unit irregex: Fix bug that prevented multibyte UTF-8 character sets
+ from being matched correctly (Thanks to Lemonboy and Chunyang Xu).
- Runtime system:
- The profiler no longer uses malloc from a signal handler which may
diff --git a/irregex-core.scm b/irregex-core.scm
index 7ac043d3..ba6d1f72 100644
--- a/irregex-core.scm
+++ b/irregex-core.scm
@@ -1407,12 +1407,11 @@
(unicode-range-up-to hi-ls)))
(let lp ((lo-ls lo-ls) (hi-ls hi-ls))
(cond
- ((null? lo-ls)
- '())
((= (car lo-ls) (car hi-ls))
(sre-sequence
- (list (integer->char (car lo-ls))
- (lp (cdr lo-ls) (cdr hi-ls)))))
+ (cons (integer->char (car lo-ls))
+ (if (null? (cdr lo-ls)) '()
+ (cons (lp (cdr lo-ls) (cdr hi-ls)) '())))))
((= (+ (car lo-ls) 1) (car hi-ls))
(sre-alternate (list (unicode-range-up-from lo-ls)
(unicode-range-up-to hi-ls))))
diff --git a/tests/test-irregex.scm b/tests/test-irregex.scm
index 1a460549..9a5402c4 100644
--- a/tests/test-irregex.scm
+++ b/tests/test-irregex.scm
@@ -538,5 +538,7 @@
(test-assert (not (irregex-search "(?u:<[^あ-ん語]*>)" "<ひらがな>")))
(test-assert (not (irregex-search "(?u:<[^あ-ん語]*>)" "<語>")))
+(test-assert (not (irregex-search (irregex "[一二]" 'utf8 #t) "三四")))
+
(test-end)
--
2.11.0
From d210eaac4762a7b5d95405b8a6b990329f941760 Mon Sep 17 00:00:00 2001
From: LemonBoy <address@hidden>
Date: Thu, 9 Nov 2017 13:29:08 +0100
Subject: [PATCH] Fix an error in unicode-range->utf8-pattern
The sequence generated for a utf8 character class contained an
unintended trailing '(), causing the code to fail when
`sre-length-ranges' is called.
Reported by Chunyang Xu at CHICKEN-users.
Signed-off-by: Peter Bex <address@hidden>
---
NEWS | 2 ++
irregex-core.scm | 7 +++----
tests/test-irregex.scm | 2 ++
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/NEWS b/NEWS
index c849aede..eff9865b 100644
--- a/NEWS
+++ b/NEWS
@@ -138,6 +138,8 @@
on s8vectors (thanks to Kristian Lein-Mathisen).
- Large literals no longer crash with "invalid encoded numeric literal"
on mingw-64 (#1344, thanks to Lemonboy).
+ - Unit irregex: Fix bug that prevented multibyte UTF-8 character sets
+ from being matched correctly (Thanks to Lemonboy and Chunyang Xu).
- Runtime system:
- The profiler no longer uses malloc from a signal handler which may
diff --git a/irregex-core.scm b/irregex-core.scm
index c83aff9b..bef8336e 100644
--- a/irregex-core.scm
+++ b/irregex-core.scm
@@ -1402,12 +1402,11 @@
(unicode-range-up-to hi-ls)))
(let lp ((lo-ls lo-ls) (hi-ls hi-ls))
(cond
- ((null? lo-ls)
- '())
((= (car lo-ls) (car hi-ls))
(sre-sequence
- (list (integer->char (car lo-ls))
- (lp (cdr lo-ls) (cdr hi-ls)))))
+ (cons (integer->char (car lo-ls))
+ (if (null? (cdr lo-ls)) '()
+ (cons (lp (cdr lo-ls) (cdr hi-ls)) '())))))
((= (+ (car lo-ls) 1) (car hi-ls))
(sre-alternate (list (unicode-range-up-from lo-ls)
(unicode-range-up-to hi-ls))))
diff --git a/tests/test-irregex.scm b/tests/test-irregex.scm
index 19218bd8..d7bfaf59 100644
--- a/tests/test-irregex.scm
+++ b/tests/test-irregex.scm
@@ -539,6 +539,8 @@
(test-assert (not (irregex-search "(?u:<[^あ-ん語]*>)" "<ひらがな>")))
(test-assert (not (irregex-search "(?u:<[^あ-ん語]*>)" "<語>")))
+(test-assert (not (irregex-search (irregex "[一二]" 'utf8 #t) "三四")))
+
(test-end)
(test-exit)
--
2.11.0
_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers