|
From: | Alex Shinn |
Subject: | Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.] |
Date: | Sun, 27 Jan 2013 20:54:37 +0900 |
Hi Alex,Yes, I would have thought that more people would be interested in having UTF-8 support in core Chicken (or at least wide-char compatible srfi-14). I have changed the title of this thread to reflect the subject more accurately :-)
Personally, I think that adding UTF-8 in core is much better than the hacks I had to do in mbox, and is a no brainer considering the benchmark results you have below. But I am sure that opinions vary on this subject...
Can you post your bounds-check patches to srfi-14 on the mailing list, and/or create a ticket for it? Hopefully there will be more responses this time.
IvanOn Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <address@hidden> wrote:
On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden> wrote:
On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <address@hidden> wrote:
Yes, I ran into this when I was adding UTF-8 support to mbox... If you were to add wide char support in srfi-14, is there a way to quantify the performance penalty?
To add the bounds check so it doesn't error? Practicallynothing.To branch to a separate path for a wide-char table ifthe bounds check fails? Same cost if the input is ASCII.For efficient handling in the case of Unicode input...how small/fast do you want it?I've never met such stony silence in response to an offer to do work...I ran the following simple char-set-contains? benchmark witha few variations:(time(do ((i 0 (+ i 1)))((= i 10000))(do ((j 0 (+ j 1)))((= j 256))(char-set-contains? char-set:letter (integer->char j)))))This is what most people are concerned about for speed, asthe boolean and construction operations are less common.The results:;; reference implementation;; 0.312s CPU time, 1/2059 GCs (major/minor);; "fixed" reference implementation (no error but no support for non-latin-1);; 0.257s CPU time, 1/1706 GCs (major/minor);; utf8-srfi-14 with full Unicode char-set:letter;; 0.243s CPU time, 0/1526 GCs (major/minor)
;; utf8-srfi-14 with ASCII-only char-set:letter;; 0.242s CPU time, 0/1526 GCs (major/minor)I was able to add the check and make the referenceimplementation faster because I fixed the common case -it was optimized for checking for 0 instead of 1.Even with the enormous and complex definition of aUnicode "letter", utf8-srfi-14 is faster than srfi-14.As for what we want in Chicken, the answer dependson what you're optimizing for. utf8-srfi-14 will alwayswin for space, and generally for speed as well.If the biggest concern is code-size, then you might wantto borrow the char-set definition from irregex and usethat as a "fallback" for non-latin-1 chars in the srfi-14reference impl. This would have the same perf assrfi-14 for latin-1, yet still support full Unicode and notincrease the size of the Chicken distribution.--Alex
[Prev in Thread] | Current Thread | [Next in Thread] |