gm2
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inquiring about the status of the proposed UNICODE library


From: Alice Osako
Subject: Re: Inquiring about the status of the proposed UNICODE library
Date: Thu, 7 Mar 2024 16:36:49 -0500
User-agent: Mozilla Thunderbird

Benjamin Kowarsch:
On Thu, 7 Mar 2024 at 18:02, Alice Osako wrote:

Even if this were not the case, supporting only a single UNICODE
encoding is potentially problematic, even if it is pretty common.

A standard should pick the absolute necessary, it should not be an egg laying wool milk sow.

OK, fair enough (I have never heard that _expression_ before - I assume it is a calque of a German idiom - but its meaning does seem clear).

In our revision we defined type UNICODE as a set of 32-bit values in ISO 10646 UCS-4 representation.

With this representation every possible Unicode code point can be represented and there is no variable length codes which would make scanning and accessing characters within strings by index extremely cumbersome.

On modern hardware there is absolutely no need to save a few bytes when processing text in memory. The need for compactness only arises when transmitting text over communications channels and when storing text on persistent storage medium. For this, the IO system can be designed to do a conversion to and from UTF-8 on the fly.

Consequently there is also no need to perform string operations on UTF-8 text in memory.

This is an excellent point; even though I only need UTF-8, the complexities of variable-length encoding are not necessary as part of an purely internal representation. I will follow this advice.

And if another format is required for compatibility with legacy systems, this should then be provided in form of a library. It should not be part of the language. Especially not when the language is based on a philosophy of simplicity.

I am looking at this solely from the perspective of a user-defined library, as I am not working on the GNU Modula-2 project myself. I would be glad to donate anything I manage to develop to the GNU Modula-2 project, but I think it would make more sense for such a library to be compiler-independent, if possible.
 
I think I will fork the Lilley code as a starting point for a library implementation, modifying it to use UTF-32 rather UTF-16. This will mean using the GPL v.3 as its license, of course, but I have no objection to that. I while it is certainly not necessary so long as I follow the GPL requirements, I would feel much better about this if I were to hear from Chris Lilley about this.

I would recommend the same approach we chose in our revision:

* read and write UTF-8 from and to disk
* convert on the fly between UTF-8 and in-memory format
* use ISO 10646 UCS-4 as in-memory format when processing text

If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.
This does sound like the best approach, though I am honestly trepidatious about the whole project. I have only a basic grasp of UNICODE in general, and am hesitant to dive into the standard(s) to the level needed for this project.

The implementation of JSON this is in aid of is already aimed at supporting a project (an LSP for Modula-2) which was itself support for another, primary project (an implementation of the Make A Lisp project in Modula-2). It is beginning to feel as if this is a bottomless well of sub-projects.

If it weren't for the knowledge that I would eventually need to tackle UNICODE for yet another set of projects in a different language entirely, I probably would simply throw my hands up and stick to the ASCII subset for JSON support. As it is, I am not at all convinced that I can see this library to completion on my own, but at the same time I would not expect anyone else to take interest in contributing to such a niche project.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]