|
From: | Alice Osako |
Subject: | Re: Inquiring about the status of the proposed UNICODE library |
Date: | Thu, 7 Mar 2024 16:36:49 -0500 |
User-agent: | Mozilla Thunderbird |
On Thu, 7 Mar 2024 at 18:02, Alice Osako wrote:
Even if this were not the case, supporting only a single UNICODE
encoding is potentially problematic, even if it is pretty common.
A standard should pick the absolute necessary, it should not be an egg laying wool milk sow.
In our revision we defined type UNICODE as a set of 32-bit values in ISO 10646 UCS-4 representation.
With this representation every possible Unicode code point can be represented and there is no variable length codes which would make scanning and accessing characters within strings by index extremely cumbersome.
On modern hardware there is absolutely no need to save a few bytes when processing text in memory. The need for compactness only arises when transmitting text over communications channels and when storing text on persistent storage medium. For this, the IO system can be designed to do a conversion to and from UTF-8 on the fly.
Consequently there is also no need to perform string operations on UTF-8 text in memory.
And if another format is required for compatibility with legacy systems, this should then be provided in form of a library. It should not be part of the language. Especially not when the language is based on a philosophy of simplicity.
This does sound like the best approach, though I am honestly trepidatious about the whole project. I have only a basic grasp of UNICODE in general, and am hesitant to dive into the standard(s) to the level needed for this project.I would recommend the same approach we chose in our revision:
* read and write UTF-8 from and to disk* convert on the fly between UTF-8 and in-memory format* use ISO 10646 UCS-4 as in-memory format when processing text
If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.
[Prev in Thread] | Current Thread | [Next in Thread] |