At some point you may have use cases where you need to
determine whether a character is Latin, Greek, Cyrillic etc.
This is actually much simpler to do with UCS-4 than other
representations because a UCS-4 value is a 4-tupel
consisting of group, plane, row and cell and you can quickly
determine what group and plane any value belongs to.
PROCEDURE isLatin ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE isGreek ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE isCyrillic ( ch : UNICHAR ) : BOOLEAN;
etc etc
With these functions you can then write functions for
lowercase/uppercase conversions and removal of diacritics
with relatively modest effort.
PROCEDURE toLower ( ch : UNICHAR ) : UNICHAR;
PROCEDURE toUpper ( ch : UNICHAR ) : UNICHAR;
assuming naming intended for qualified import, thus
Unichar.toLower() and Unichar.toUpper().
PROCEDURE hasDiacritic ( ch : UNICHAR ) : BOOLEAN;
PROCEDURE charByRemovingDiacritic ( ch : UNICHAR ) :
UNICHAR;
You may also want to look up normalisation of Unicode
code points and implement
PROCEDURE normalize ( ch : UNICHAR ) : UNICHAR;
If you later want to implement a Unicode string library
with lexicographic sorting, you will also need some basic
functions to determine the lexicographical relationship
between to characters, like
PROCEDURE precedes ( ch1, ch2 : UNICHAR ) : BOOLEAN;
PROCEDURE succeeds ( ch1, ch2 : UNICHAR ) : BOOLEAN;
To make the lexicographical order configurable, implement
user overloadable tables that define the collation order for
each script. Whichever table is installed will then be
looked up by the above functions precedes() and succeeds()
to determine which character comes before/after which.