bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] Changing the appearance of escapes


From: Mike Gran
Subject: Re: [bug-libunistring] Changing the appearance of escapes
Date: Fri, 17 Sep 2010 11:04:11 -0700 (PDT)

Hi Ludo, Bruno-

I'm not sure if 'replying to all' is appropriate here, but, I have
a couple of comments.

> > The way I recommend to do it is:
> >  - For ports with an input direction, store in the port an iconv_t 
descriptor
> >    from the given encoding to UTF-8. Similarly, for ports with an output
> >    direction, store in it an iconv_t descriptor from UTF-8 to the encoding.
> >    (Why UTF-8 and not UTF-32 = UCS-4? Because on all platforms you can 
>convert
> >    from UTF-8 to anything and vice versa, but not from UTF-32 from/to 
>anything.
> >    Solaris for example.)
> 
> Hmm, OK.  It’s actually not a problem to use UTF-8 instead of UCS-4 when
> reading from an input port.
> 
> >  - In the input direction you'll also need a small buffer (up to 6 bytes or 
>so)
> >    for bytes that have already been read from the stream but not yet 
>converted
> >    to characters. Near this, you'll also have a character or bit that is 
used
> >    to implement the CRLF -> LF conversion.
> >  - The most tricky thing is to handle all possible errors and return values
> >    from iconv() correctly.
> >  - In the output direction, an iconv_t can produce a couple of bytes at the
> >    end, that you need to output before closing the stream. This is needed 
for
> >    stateful encodings such as CP1258, UTF-7, or UTF-16 (with BOM). But only
> >    if you want to support stateful encodings at all. All encodings used by
> >    locales are stateless.
> 
> OK.

Guile has basically two port APIs now: the legacy API and the R6RS API.
If we are considering writing our own iconv_t-based converter for Guile
because of the escapes problem, we could also start supporting stateful
encodings.  But, to do so, it would be convenient to try to push people
along to the new R6RS API.

The problem with the legacy input API w.r.t stateful encodings is the
'unread-char' and 'unread-string' operations.  Unreading stateful encodings
is difficult. (It could be done in our legacy ports by always keeping
a dynamically allocated pushback buffer encoded in UTF-32 or UTF-8 instead
of using the port's encoding like we do now.)

I believe that the R6RS has lookahead ops but no unread ops.  Lookahead
is easier because, if I understand correctly, you can make a copy of
the the iconv_t and use that to do the lookahead.  That way, instead
of fetching lookahead data and then buffering it in the pushback buffer,
you don't need a pushback buffer at all and can rely on the underlying
libc to do the unget caching.

In the R6RS case, the R6RS transcoder becomes a stateful 
input-iconv_t/output-iconv_t pair and can basically be its own object,
independent of the port but dynamically attached to it.


Thanks,

Mike



reply via email to

[Prev in Thread] Current Thread [Next in Thread]