[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] utf8 and string-ref performance

From: Peter Bex
Subject: Re: [Chicken-users] utf8 and string-ref performance
Date: Wed, 24 Nov 2010 17:05:18 +0100
User-agent: Mutt/

On Wed, Nov 24, 2010 at 08:37:37AM -0700, Alan Post wrote:
> gentufa'i works by storing the entire input port in a string, and
> ceating position objects to refer to the "rest of the string" as I
> parse.
> This means I need to perform the following:
> 1) reference a character by index
> 2) compare a character, string, or regular expression starting
>    at an index.

Are you sure you need this?  If I understood the sentence above the
list correctly, it might be enough to use string->list and then work
with a list of characters.  This can be done pretty fast, and you
can store pointers into arbitrary places of the input simply by storing
the relevant cons cell.

> Without utf8, step 1 is O(1).  With utf8 enabled, that step becomes
> O(n).  step 2 is also more expensive with utf8, as I have to pay that
> same O(n) to get to the correct index, and I suspect I have some
> O(n*1) operations that become O(n*m), namely character class
> comparisons.

utf8 uses the fantastic "iset" egg for dealing with character sets.
membership tests are as fast as O(1) for short ranges, and become
O(log(n)) for longer ones, where n is the number of ranges it needs to
store.  My knowledge of complexity theory is a little rusty so I'm
probably describing this in a very roundabout fashion, but the point is
that you shouldn't worry about charset performance.  It'll be slower
than built-in srfi-14 (which is insanely fast - because it's a damn hack :P)
but not by too much.

> Essentially, I take characters off the front of the string as I
> parse, with the caveat that PEG parsers support full backtracking,
> so I sometimes retrieve previous position objects and work from
> there--I can't just throw away the prefix of the string I've
> matched.

With cons cells you should be able to implement this efficiently enough.
We don't have anything like string-pointers which can store arbitrary
indices in a string AFAIK.  That would be useful to have, I guess.

> Also, I'm not 100% sure what the utf8 gets me, compared to treating
> the string like binary data.  I suspect it would work if I had a
> utf8 input file, a utf8 string to match, but that I compare them as
> binary data.  I couldn't compare a utf8 *character*, like
> #\¿, but I think I could compare "¿".  (Those are both inverted
> question marks, if you don't have utf8 e-mail support)  Am I wrong
> about that?

No, that's correct. It's only when you start operating on the character
level (or when splitting strings for example) that utf8 makes a difference.

"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                                                        -- Donald Knuth

reply via email to

[Prev in Thread] Current Thread [Next in Thread]