[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common h
From: |
.alyn.post. |
Subject: |
Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.] |
Date: |
Mon, 28 Jan 2013 15:26:23 -0700 |
I'll throw in my two bits here.
I'm not personally decided whether utf-8 in core would be an
improvement. I don't have enough background or knowledge of
the internals to contribute to that decision.
I can offer this, however:
I have found that I have to use utf-8 support in every project
I've written in Chicken. I do so, and have only had a problem
when the utf-8 egg did not map a procedure from core properly.
I'm getting by just fine with the current state of affairs, and
I do have a certain nostalgic love of ASCII. If I *could* get
away with only having ASCII, I would. This has not been true
in practice.
My experience with numbers is slightly different, where I do
find I need to do word-level calculation where I depend on the
underlying machine implementation of character- and pointer-sized
integers. I use the fx versions of these functions when I do
rely on this, but I mainly have found I must intentionally subvert
the numeric tower to get a specific behavior. This has never been
true when I've dealt with characters.
FWIW,
-Alan
On Sun, Jan 27, 2013 at 10:43:41AM +0900, Ivan Raikov wrote:
> Hi Alex,
>
> *** Yes, I would have thought that more people would be interested in
> having UTF-8 support in core Chicken (or at least wide-char compatible
> srfi-14). I have changed the title of this thread to reflect the subject
> more accurately :-)
>
> * Personally, I think that adding UTF-8* in core is much better than the
> hacks I had to do in mbox, and is a no brainer considering the benchmark
> results you have below.* But I am sure that opinions vary on this
> subject...
>
> ** Can you post your bounds-check patches to srfi-14 on the mailing list,
> and/or create a ticket for it? Hopefully there will be more responses this
> time.
>
> *** Ivan
> On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <address@hidden>
> wrote:
>
> On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden>
> wrote:
>
> On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov
> <address@hidden> wrote:
>
> Yes, I ran into this when I was adding UTF-8 support to mbox... If
> you were to add wide char support in srfi-14, is there a way to
> quantify the performance penalty?
>
> To add the bounds check so it doesn't error? *Practically
> nothing.
> To branch to a separate path for a wide-char table if
> the bounds check fails? *Same cost if the input is ASCII.
> For efficient handling in the case of Unicode input...
> how small/fast do you want it?
>
> I've never met such stony silence in response to an offer to do work...
> I ran the following simple char-set-contains? benchmark with
> a few variations:
> * (time
> * *(do ((i 0 (+ i 1)))
> * * * *((= i 10000))
> * * * *(do ((j 0 (+ j 1)))
> * * * * * *((= j 256))
> * * * * *(char-set-contains? char-set:letter (integer->char j)))))
> This is what most people are concerned about for speed, as
> the boolean and construction operations are less common.
> The results:
> ;; reference implementation
> ;; 0.312s CPU time, 1/2059 GCs (major/minor)
> ;; "fixed" reference implementation (no error but no support for
> non-latin-1)
> ;; 0.257s CPU time, 1/1706 GCs (major/minor)
> ;; utf8-srfi-14 with full Unicode char-set:letter
> ;; 0.243s CPU time, 0/1526 GCs (major/minor)
> ;; utf8-srfi-14 with ASCII-only char-set:letter
> ;; 0.242s CPU time, 0/1526 GCs (major/minor)
> I was able to add the check and make the reference
> implementation faster because I fixed the common case -
> it was optimized for checking for 0 instead of 1.
> Even with the enormous and complex definition of a
> Unicode "letter", utf8-srfi-14 is faster than srfi-14.
> As for what we want in Chicken, the answer depends
> on what you're optimizing for. *utf8-srfi-14 will always
> win for space, and generally for speed as well.
> If the biggest concern is code-size, then you might want
> to borrow the char-set definition from irregex and use
> that as a "fallback" for non-latin-1 chars in the srfi-14
> reference impl. *This would have the same perf as
> srfi-14 for latin-1, yet still support full Unicode and not
> increase the size of the Chicken distribution.
> --*
> Alex
>
> References
>
> Visible links
> 1. mailto:address@hidden
> 2. mailto:address@hidden
> 3. mailto:address@hidden
> _______________________________________________
> Chicken-users mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/chicken-users
--
my personal website: http://c0redump.org/