[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Chicken-users] utf8 and string-ref performance
From: |
Alan Post |
Subject: |
[Chicken-users] utf8 and string-ref performance |
Date: |
Wed, 24 Nov 2010 08:37:37 -0700 |
I'm reaching a point where my PEG parser, gentufa'i[1], is going to
be ready to tag version 1.0.
I have an issue that I've been putting off that I would like some
input on.
If possible, I would like to parse utf8 input. I currently have
utf8 enabled in my egg.
gentufa'i works by storing the entire input port in a string, and
ceating position objects to refer to the "rest of the string" as I
parse.
This means I need to perform the following:
1) reference a character by index
2) compare a character, string, or regular expression starting
at an index.
Without utf8, step 1 is O(1). With utf8 enabled, that step becomes
O(n). step 2 is also more expensive with utf8, as I have to pay that
same O(n) to get to the correct index, and I suspect I have some
O(n*1) operations that become O(n*m), namely character class
comparisons. I think that means step 2 becomes O(x*y*z) rather than
O(1*y*1), (x=index, y=string comparison, z=character class comparison)
but don't hold me to that!
I suspect there is a way to avoid the O(n) penalty for step 1, but
I'm uncertain how to do it. I have some patterns to the way I
index, all of which are contained the position object in my code[2]:
1) I increment the index position by one character
2) I increment the index position by the length of the string I just
succeeded at matching.
Essentially, I take characters off the front of the string as I
parse, with the caveat that PEG parsers support full backtracking,
so I sometimes retrieve previous position objects and work from
there--I can't just throw away the prefix of the string I've
matched.
Can anyone point me in the right direction?
Also, I'm not 100% sure what the utf8 gets me, compared to treating
the string like binary data. I suspect it would work if I had a
utf8 input file, a utf8 string to match, but that I compare them as
binary data. I couldn't compare a utf8 *character*, like
#\¿, but I think I could compare "¿". (Those are both inverted
question marks, if you don't have utf8 e-mail support) Am I wrong
about that?
Thank you for your help.
1: http://wiki.call-cc.org/eggref/4/genturfahi
2: http://bugs.call-cc.org/browser/release/4/genturfahi/trunk/lerfu-porsi.scm
-Alan
--
.i ko djuno fi le do sevzi
- [Chicken-users] utf8 and string-ref performance,
Alan Post <=