On Fri, 8 Mar 2024 at 06:36, Alice Osako wrote:
I think I will fork the Lilley code as a starting point for a
library implementation, modifying it to use UTF-32 rather UTF-16.
This will mean using the GPL v.3 as its license, of course, but I
have no objection to that. I while it is certainly not necessary so
long as I follow the GPL requirements, I would feel much better
about this if I were to hear from Chris Lilley about this.
I think you are overthinking the task at hand.
Write down your requirements first, like so:
(1) read UTF-8 from a file, convert every UTF-8 value read to an equivalent UCS-4 value
(2) write UTF-8 to a file, convert every UCS-4 value to be written to an equivalent UTF-8 value.
Then ask yourself questions like
Do I need uppercase/lowercase conversion on the UCS-4 text?
Do I need to match accented to non-accented UCS-4 characters and vice versa?
Do I need lexicographical sorting of UCS-4 text? If so, in which realm (Latin, Greek, Cyrillic...)?
etc etc
You will find that converting between UTF-8 and UCS-4 is rather straightforward.
The complexity that you seem to find intimidating lies in the text processing bits.
But how much of the latter do you really need for a JSON parser?
The way I see it, your task is to write a UTF-8 to UCS-4 decoder, and a UCS-4 to UTF-8 encoder.
And that task will be easier to do from scratch than to modify an existing UTF-8/UCS-2 decoder/encoder.
I would recommend the same approach we chose in our
revision:
* read and write UTF-8 from and to disk
* convert on the fly between UTF-8 and in-memory format
* use ISO 10646 UCS-4 as in-memory format when processing
text
If any other encodings are needed, convert between
in-memory UCS-4 and those formats while doing IO.
This does sound like the best approach, though I am honestly
trepidatious about the whole project. I have only a basic grasp of
UNICODE in general, and am hesitant to dive into the standard(s) to
the level needed for this project.
Like I said, you are overthinking this.
What made Unicode so nightmarish to work with were the 16-bit formats because they didn't cover the entire Unicode range and weren't linear. Also, they brought endianness issues with them.
But UTF-8 is a stream of single octets, and UCS-4 covers the entire Unicode range and is linear. All the complexities related to encoding and decoding that exist with 16-bit formats are avoided altogether.
Your UTF8 I/O char buffer will be an ARRAY [0 .. 5] OF [0 .. 127] OF CARDINAL.
Your UNICHAR type will be a range [0 .. 10FFFFH] OF CARDINAL.
The UTF-8 specification (IETF RFC 2279) defines a very simple encoding and decoding algorithm for converting between these two representations.
The mappings between UTF-8 and UCS-4 are as follows:
0000 0000 .. 0000 007F 0xxxxxxx
0000 0080 .. 0000 07FF 110xxxxx 10xxxxxx
0000 0800 .. 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 .. 001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000 .. 03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000 .. 7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
So, there are six cases to be distinguished and then simply mapped.
There are three steps:
(1) input data verification
For UTF-8 to UCS-5 decoding, assert that there are no leading indicator bits that don't fit the above pattern.
For UCS-4 to UTF-8 encoding, if you so desire, you can use a lookup table to filter out (yet) unassigned code points.
(2) determination which of the six cases apply
For UTF-8 to UCS-4 decoding, test the leading bits of the first octet to determine which case.
For UCS-4 to UTF-8 encoding, test the range in which the UCS-4 value lies to determine which case.
(3) copying the payload bits
Then all you have to do is collect the payload bits and copy them into their positions in the target value.
Basically you need two functions
PROCEDURE utf8ToUnichar ( utf8 : ARRAY [0..5] OF CARDINAL [0..127]; VAR ch : UNICHAR );
PROCEDURE unicharToUtf8 ( ch : UNICHAR; VAR utf8 : ARRAY [0..5] OF CARDINAL [0..127] );
That's it. Straightforward. There will be tons of things you have done without much prior knowledge that were significantly more complex. Don't overthink. Don't be intimidated. It's not rocket science.
regards
benjamin