[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] utf8 and string-ref performance

From: Alan Post
Subject: [Chicken-users] utf8 and string-ref performance
Date: Wed, 24 Nov 2010 08:37:37 -0700

I'm reaching a point where my PEG parser, gentufa'i[1], is going to
be ready to tag version 1.0.

I have an issue that I've been putting off that I would like some
input on.

If possible, I would like to parse utf8 input.  I currently have
utf8 enabled in my egg.

gentufa'i works by storing the entire input port in a string, and
ceating position objects to refer to the "rest of the string" as I

This means I need to perform the following:

1) reference a character by index
2) compare a character, string, or regular expression starting
   at an index.

Without utf8, step 1 is O(1).  With utf8 enabled, that step becomes
O(n).  step 2 is also more expensive with utf8, as I have to pay that
same O(n) to get to the correct index, and I suspect I have some
O(n*1) operations that become O(n*m), namely character class
comparisons.  I think that means step 2 becomes O(x*y*z) rather than
O(1*y*1), (x=index, y=string comparison, z=character class comparison)
but don't hold me to that!

I suspect there is a way to avoid the O(n) penalty for step 1, but
I'm uncertain how to do it.  I have some patterns to the way I
index, all of which are contained the position object in my code[2]:

1) I increment the index position by one character
2) I increment the index position by the length of the string I just
   succeeded at matching.

Essentially, I take characters off the front of the string as I
parse, with the caveat that PEG parsers support full backtracking,
so I sometimes retrieve previous position objects and work from
there--I can't just throw away the prefix of the string I've

Can anyone point me in the right direction?

Also, I'm not 100% sure what the utf8 gets me, compared to treating
the string like binary data.  I suspect it would work if I had a
utf8 input file, a utf8 string to match, but that I compare them as
binary data.  I couldn't compare a utf8 *character*, like
#\¿, but I think I could compare "¿".  (Those are both inverted
question marks, if you don't have utf8 e-mail support)  Am I wrong
about that?

Thank you for your help.


.i ko djuno fi le do sevzi

reply via email to

[Prev in Thread] Current Thread [Next in Thread]