|
From: | Alice Osako |
Subject: | Re: Inquiring about the status of the proposed UNICODE library |
Date: | Fri, 8 Mar 2024 05:45:02 -0500 |
User-agent: | Mozilla Thunderbird |
On Fri, 8 Mar 2024 at 06:36, Alice Osako wrote:
I think I will fork the Lilley code as a starting point for a library implementation, modifying it to use UTF-32 rather UTF-16. This will mean using the GPL v.3 as its license, of course, but I have no objection to that. I while it is certainly not necessary so long as I follow the GPL requirements, I would feel much better about this if I were to hear from Chris Lilley about this.
I think you are overthinking the task at hand.
Write down your requirements first, like so:
(1) read UTF-8 from a file, convert every UTF-8 value read to an equivalent UCS-4 value(2) write UTF-8 to a file, convert every UCS-4 value to be written to an equivalent UTF-8 value.
Then ask yourself questions like
Do I need uppercase/lowercase conversion on the UCS-4 text?
Do I need to match accented to non-accented UCS-4 characters and vice versa?Do I need lexicographical sorting of UCS-4 text? If so, in which realm (Latin, Greek, Cyrillic...)?etc etc
You will find that converting between UTF-8 and UCS-4 is rather straightforward.
The complexity that you seem to find intimidating lies in the text processing bits.
But how much of the latter do you really need for a JSON parser?
The way I see it, your task is to write a UTF-8 to UCS-4 decoder, and a UCS-4 to UTF-8 encoder.
And that task will be easier to do from scratch than to modify an existing UTF-8/UCS-2 decoder/encoder.
<detailed implementation recommendations snipped>This does sound like the best approach, though I am honestly trepidatious about the whole project. I have only a basic grasp of UNICODE in general, and am hesitant to dive into the standard(s) to the level needed for this project.I would recommend the same approach we chose in our revision:
* read and write UTF-8 from and to disk* convert on the fly between UTF-8 and in-memory format* use ISO 10646 UCS-4 as in-memory format when processing text
If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.
Like I said, you are overthinking this.
What made Unicode so nightmarish to work with were the 16-bit formats because they didn't cover the entire Unicode range and weren't linear. Also, they brought endianness issues with them.
But UTF-8 is a stream of single octets, and UCS-4 covers the entire Unicode range and is linear. All the complexities related to encoding and decoding that exist with 16-bit formats are avoided altogether.
Basically you need two functions
PROCEDURE utf8ToUnichar ( utf8 : ARRAY [0..5] OF CARDINAL [0..127]; VAR ch : UNICHAR );
PROCEDURE unicharToUtf8 ( ch : UNICHAR; VAR utf8 : ARRAY [0..5] OF CARDINAL [0..127] );
That's it. Straightforward. There will be tons of things you have done without much prior knowledge that were significantly more complex. Don't overthink. Don't be intimidated. It's not rocket science.
[Prev in Thread] | Current Thread | [Next in Thread] |