chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common h


From: Alex Shinn
Subject: Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]
Date: Sun, 27 Jan 2013 20:54:37 +0900

On Sun, Jan 27, 2013 at 10:43 AM, Ivan Raikov <address@hidden> wrote:

Hi Alex,

    Yes, I would have thought that more people would be interested in having UTF-8 support in core Chicken (or at least wide-char compatible srfi-14). I have changed the title of this thread to reflect the subject more accurately :-)

  Personally, I think that adding UTF-8  in core is much better than the hacks I had to do in mbox, and is a no brainer considering the benchmark results you have below.  But I am sure that opinions vary on this subject...

   Can you post your bounds-check patches to srfi-14 on the mailing list, and/or create a ticket for it? Hopefully there will be more responses this time.

Well, I'm not necessarily proposing UTF-8 support in the core.
I understand that has pros and cons and opinions may differ.

I was just pointing out that we're already got 3 char-set
implementations, 2 of them in the core distribution, and
there are no real cons to simplifying this and replacing
srfi-14 with one of the Unicode-capable implementations.

The simplest change I made was replacing:

(define-inline (si=0? s i) (zero? (%char->latin1 (string-ref s i))))
(define-inline (si=1? s i) (not (si=0? s i)))

with:

(define-inline (si=0? s i) (if (>= i 256) #t (zero? (%char->latin1 (string-ref s i)))))
(define-inline (si=1? s i) (and (< i 256) (eq? 1 (%char->latin1 (string-ref s i)))))

which is actually faster and while it doesn't support
wide char-sets, at least gives the correct answers when
passed wide chars.

-- 
Alex


    Ivan

On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <address@hidden> wrote:
On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden> wrote:
On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <address@hidden> wrote:
Yes, I ran into this when I was adding UTF-8 support to mbox... If you were to add wide char support in srfi-14, is there a way to quantify the performance penalty?

To add the bounds check so it doesn't error?  Practically
nothing.

To branch to a separate path for a wide-char table if
the bounds check fails?  Same cost if the input is ASCII.

For efficient handling in the case of Unicode input...
how small/fast do you want it?

I've never met such stony silence in response to an offer to do work...

I ran the following simple char-set-contains? benchmark with
a few variations:

  (time
   (do ((i 0 (+ i 1)))
       ((= i 10000))
       (do ((j 0 (+ j 1)))
           ((= j 256))
         (char-set-contains? char-set:letter (integer->char j)))))

This is what most people are concerned about for speed, as
the boolean and construction operations are less common.

The results:

;; reference implementation
;; 0.312s CPU time, 1/2059 GCs (major/minor)

;; "fixed" reference implementation (no error but no support for non-latin-1)
;; 0.257s CPU time, 1/1706 GCs (major/minor)

;; utf8-srfi-14 with full Unicode char-set:letter
;; 0.243s CPU time, 0/1526 GCs (major/minor)

;; utf8-srfi-14 with ASCII-only char-set:letter
;; 0.242s CPU time, 0/1526 GCs (major/minor)

I was able to add the check and make the reference
implementation faster because I fixed the common case -
it was optimized for checking for 0 instead of 1.

Even with the enormous and complex definition of a
Unicode "letter", utf8-srfi-14 is faster than srfi-14.

As for what we want in Chicken, the answer depends
on what you're optimizing for.  utf8-srfi-14 will always
win for space, and generally for speed as well.

If the biggest concern is code-size, then you might want
to borrow the char-set definition from irregex and use
that as a "fallback" for non-latin-1 chars in the srfi-14
reference impl.  This would have the same perf as
srfi-14 for latin-1, yet still support full Unicode and not
increase the size of the Chicken distribution.

-- 
Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]