gm2
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inquiring about the status of the proposed UNICODE library


From: Benjamin Kowarsch
Subject: Re: Inquiring about the status of the proposed UNICODE library
Date: Mon, 11 Mar 2024 00:03:50 +0900

If you design the library for unqualified import, you may call the latter function toChar, calling it as Unichar.toChar().
This should have been "for qualified import".

At some point you may have use cases where you need to determine whether a character is Latin, Greek, Cyrillic etc. This is actually much simpler to do with UCS-4 than other representations because a UCS-4 value is a 4-tupel consisting of group, plane, row and cell and you can quickly determine what group and plane any value belongs to.

PROCEDURE isLatin ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE isGreek ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE isCyrillic ( ch : UNICHAR ) : BOOLEAN;
etc etc

With these functions you can then write functions for lowercase/uppercase conversions and removal of diacritics with relatively modest effort.

PROCEDURE toLower ( ch : UNICHAR ) : UNICHAR;
PROCEDURE toUpper ( ch : UNICHAR ) : UNICHAR;

assuming naming intended for qualified import, thus Unichar.toLower() and Unichar.toUpper().

PROCEDURE hasDiacritic ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE charByRemovingDiacritic ( ch : UNICHAR ) : UNICHAR;

You may also want to look up normalisation of Unicode code points and implement

PROCEDURE normalize ( ch : UNICHAR ) : UNICHAR;

If you later want to implement a Unicode string library with lexicographic sorting, you will also need some basic functions to determine the lexicographical relationship between to characters, like

PROCEDURE precedes ( ch1, ch2 : UNICHAR ) : BOOLEAN;
PROCEDURE succeeds ( ch1, ch2 : UNICHAR ) : BOOLEAN;

To make the lexicographical order configurable, implement user overloadable tables that define the collation order for each script. Whichever table is installed will then be looked up by the above functions precedes() and succeeds() to determine which character comes before/after which.

In our revision of M2 we hide all of this boilerplate behind syntax sugar, so you can just do

IF str1 > str2 THEN ... END; where str1 and str2 are of type UNISTRING.

But with PIM and ISO, you need to call up the underlying library functions directly, so you need to consider usability when designing the functions.

regards
benjamin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]