Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common h

chicken-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common h

From:	.alyn.post.
Subject:	Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]
Date:	Mon, 28 Jan 2013 15:26:23 -0700

I'll throw in my two bits here.

I'm not personally decided whether utf-8 in core would be an
improvement.  I don't have enough background or knowledge of
the internals to contribute to that decision.

I can offer this, however:

I have found that I have to use utf-8 support in every project
I've written in Chicken.  I do so, and have only had a problem
when the utf-8 egg did not map a procedure from core properly.

I'm getting by just fine with the current state of affairs, and
I do have a certain nostalgic love of ASCII.  If I *could* get
away with only having ASCII, I would.  This has not been true
in practice.

My experience with numbers is slightly different, where I do
find I need to do word-level calculation where I depend on the
underlying machine implementation of character- and pointer-sized
integers.  I use the fx versions of these functions when I do
rely on this, but I mainly have found I must intentionally subvert
the numeric tower to get a specific behavior.  This has never been
true when I've dealt with characters.

FWIW,

-Alan

On Sun, Jan 27, 2013 at 10:43:41AM +0900, Ivan Raikov wrote:
>    Hi Alex,
> 
>    *** Yes, I would have thought that more people would be interested in
>    having UTF-8 support in core Chicken (or at least wide-char compatible
>    srfi-14). I have changed the title of this thread to reflect the subject
>    more accurately :-)
> 
>    * Personally, I think that adding UTF-8* in core is much better than the
>    hacks I had to do in mbox, and is a no brainer considering the benchmark
>    results you have below.* But I am sure that opinions vary on this
>    subject...
> 
>    ** Can you post your bounds-check patches to srfi-14 on the mailing list,
>    and/or create a ticket for it? Hopefully there will be more responses this
>    time.
> 
>    *** Ivan
>    On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <address@hidden>
>    wrote:
> 
>      On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <address@hidden>
>      wrote:
> 
>        On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov
>        <address@hidden> wrote:
> 
>          Yes, I ran into this when I was adding UTF-8 support to mbox... If
>          you were to add wide char support in srfi-14, is there a way to
>          quantify the performance penalty?
> 
>        To add the bounds check so it doesn't error? *Practically
>        nothing.
>        To branch to a separate path for a wide-char table if
>        the bounds check fails? *Same cost if the input is ASCII.
>        For efficient handling in the case of Unicode input...
>        how small/fast do you want it?
> 
>      I've never met such stony silence in response to an offer to do work...
>      I ran the following simple char-set-contains? benchmark with
>      a few variations:
>      * (time
>      * *(do ((i 0 (+ i 1)))
>      * * * *((= i 10000))
>      * * * *(do ((j 0 (+ j 1)))
>      * * * * * *((= j 256))
>      * * * * *(char-set-contains? char-set:letter (integer->char j)))))
>      This is what most people are concerned about for speed, as
>      the boolean and construction operations are less common.
>      The results:
>      ;; reference implementation
>      ;; 0.312s CPU time, 1/2059 GCs (major/minor)
>      ;; "fixed" reference implementation (no error but no support for
>      non-latin-1)
>      ;; 0.257s CPU time, 1/1706 GCs (major/minor)
>      ;; utf8-srfi-14 with full Unicode char-set:letter
>      ;; 0.243s CPU time, 0/1526 GCs (major/minor)
>      ;; utf8-srfi-14 with ASCII-only char-set:letter
>      ;; 0.242s CPU time, 0/1526 GCs (major/minor)
>      I was able to add the check and make the reference
>      implementation faster because I fixed the common case -
>      it was optimized for checking for 0 instead of 1.
>      Even with the enormous and complex definition of a
>      Unicode "letter", utf8-srfi-14 is faster than srfi-14.
>      As for what we want in Chicken, the answer depends
>      on what you're optimizing for. *utf8-srfi-14 will always
>      win for space, and generally for speed as well.
>      If the biggest concern is code-size, then you might want
>      to borrow the char-set definition from irregex and use
>      that as a "fallback" for non-latin-1 chars in the srfi-14
>      reference impl. *This would have the same perf as
>      srfi-14 for latin-1, yet still support full Unicode and not
>      increase the size of the Chicken distribution.
>      --*
>      Alex
> 
> References
> 
>    Visible links
>    1. mailto:address@hidden
>    2. mailto:address@hidden
>    3. mailto:address@hidden

> _______________________________________________
> Chicken-users mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/chicken-users


-- 
my personal website: http://c0redump.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.], Ivan Raikov, 2013/01/26
- Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.], Alex Shinn, 2013/01/27
- Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.], .alyn.post. <=
  - Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.], Alex Shinn, 2013/01/28
    - Re: [Chicken-users] UTF-8 support in Chicken core, Felix, 2013/01/29

Prev by Date: [Chicken-users] Msgpack implementation for scheme (and some questions)
Next by Date: Re: [Chicken-users] Msgpack implementation for scheme (and some questions)
Previous by thread: Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]
Next by thread: Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]
Index(es):
- Date
- Thread