[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Patch for CHICKEN 6 uri-generic
From: |
Peter Bex |
Subject: |
Re: Patch for CHICKEN 6 uri-generic |
Date: |
Wed, 11 Sep 2024 11:43:42 +0200 |
Hi Ivan,
I went ahead and committed this code. The initial version that was
copied from CHICKEN 5 by Felix didn't work properly anyway, so it's
better than what was there. I also tagged it as 4.0 so we can install
other eggs depending on it and move forward.
Feel free to improve the code, and we can tag a new improved version.
Nobody is really relying on the exact way the code decodes UTF-8 yet
anyway, so it's fine if we change it a bit.
Cheers,
Peter
On Sat, Aug 31, 2024 at 06:51:47PM -0700, Ivan Raikov wrote:
> Hello Peter,
>
> Thanks for your patience, and apologies for blocking porting the web
> stack. It has been a really busy summer for me. I think pct-encode and
> pct-decode contain many undocumented constants, which makes it
> difficult to understand for someone unfamiliar with UTF8 encoding. Let
> me read through and try to understand and at least annotate the logic
> over the next couple of days, so that it is relatable to the UTF-8
> byte sequence syntax in RFC 3629.
>
> Thanks,
> Ivan
>
> On Tue, Aug 27, 2024 at 5:47 AM Peter Bex <peter@more-magic.net> wrote:
> >
> > Hi Ivan,
> >
> > I'd like to get something committed, this is blocking porting efforts
> > of the rest of the web stack, which I'd like to work on during the
> > Gosling CHICKEN event. Would you object if I just commit what I have
> > now? We can make improvements on this as we go.
> > Note that CHICKEN 6 isn't officially out yet anyway, but it'd be nice
> > if most of the important eggs already work on the day it's released.
> >
> > Cheers,
> > Peter
> >
> > On Wed, May 22, 2024 at 11:33:08AM -0700, Ivan Raikov wrote:
> > > Hello Peter,
> > >
> > > Thanks a lot for the patch! Overall it looks ok, but it has been quite
> > > a while since I have had to deal with UTF-8 at this level of detail,
> > > so I don't really understand all the bitwise operations and range
> > > comparisons. I am wondering if it is possible to factor out the
> > > UTF-8-specific logic into a separate module and let it be invoked by
> > > the uri-generic parsing routines. Also, I think it would be
> > > tremendously helpful to use named constants, as I don't quite know the
> > > significance of #x800 or #x10000. Perhaps CHICKEN 6 already offers the
> > > definitions and routines to make this code more readable? I will try
> > > to install CHICKEN 6 and actually run the code with your patch soon.
> > >
> > > Thanks,
> > > Ivan
> > >
> > > On Thu, May 16, 2024 at 2:52 AM Peter Bex <peter@more-magic.net> wrote:
> > > >
> > > > On Wed, May 15, 2024 at 02:44:07PM +0200, Peter Bex wrote:
> > > > > Unfortunately, it also means we must now choose to reject certain URIs
> > > > > (at least in uri-common) by raising an exception instead of allowing
> > > > > them
> > > > > to be decoded. These are for invalid UTF-8 encoded characters, either
> > > > > because they're a truncated byte sequence or because they encode a
> > > > > character in too many bytes.
> > > >
> > > > I realised that there was a bug in this code, since "eat-rest-chars"
> > > > would consume the percent-encoded bytes and then they'd get discarded
> > > > in case the set of characters to decode doesn't contain the decoded
> > > > character in question.
> > > >
> > > > After trying this out, I noticed that the code actually worked, to my
> > > > astonishment. Then I realised that this was because the code would
> > > > still cons the UTF-8 *leading byte* back onto the result, and then
> > > > traverse the cdr of the UTF-8 tail bytes as the rest list, which would
> > > > pass through unprocessed.
> > > >
> > > > I added test cases for both situations and changed the code to detect
> > > > UTF-8 continuation bytes without a leading byte and bail out in such
> > > > cases, or other unforeseen and unhandled cases (the "else" in the main
> > > > decoder). And of course tweaked the eat-rest-chars code and callers to
> > > > always restore all of the undecoded bytes.
> > > >
> > > > Still not super happy with using "values" here and the way we're passing
> > > > in somewhat redundant information about the first consumed byte. I
> > > > realise
> > > > an alternative would be to pass the success continuation as an argument
> > > > but I don't think that makes the code much clearer.
> > > >
> > > > Cheers,
> > > > Peter
> > >
> > >
>
>
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Patch for CHICKEN 6 uri-generic,
Peter Bex <=