bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-apl] Dealing with under-bar characters, from the Unicode mailing li


From: alexweiner
Subject: [Bug-apl] Dealing with under-bar characters, from the Unicode mailing list
Date: Sun, 16 Aug 2015 10:16:08 -0700
User-agent: Workspace Webmail 5.15.6

Hi Bug APL,

I ventured over the the Unicode mailing list to discuss why under-bar characters are not in the Unicode standard. The suggestion provided was that we should maybe use something called "grapheme clusters". This link was provided:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

as well as this, which discusses how to count characters:

http://unicode.org/faq/char_combmark.html#7


I tried to make sense of it, but it seems a bit over my head currently. Maybe someone could make sense of this and see if it aids our situation. 


-Alex
-------- Original Message --------
Subject: Re: APL Under-bar Characters
From: Khaled Hosny 
Date: Sun, August 16, 2015 9:53 am
To: alexweiner
Cc: address@hidden

On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner wrote:
> Khaled,
> Thank you for the link. The normalization methods were already discussed,
> specifically here:
>
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html

Grapheme cluster boundaries detection is different from normalisation,
please read the link I provided.

> Where the problem of "how big" is ä is discussed. The answer being that this is
> one symbol, because the Unicode Consortium decided that it is also its own
> standalone character. From the thread:
>
> I'll give you an example. What would you want ⍴,'ä' to be?
>
> Right now, that could return either 1 or 2 depending on whether the ä was using
> the precomposed character (U+00E4) or the combining mark (U+0061, U+0308).
> Visually, these are identical, and generally you'd expect them to compare
> equal.

If you are counting grapheme clusters, then the answer is one in both
cases.

> In Unicode, the comparison of equivalent (but with different characters)
> strings are done by performing a normalisation step prior to comparison. There
> are 4 different types of normalisation, with different behaviour.

Quoting from the link I provided:

A key feature of default Unicode grapheme clusters (both legacy and
extended) is that they remain unchanged across all canonically
equivalent forms of the underlying text. Thus the boundaries remain
unchanged whether the text is in NFC or NFD. Using a grapheme
cluster as the fundamental unit of matching thus provides a very
clear and easily explained basis for canonically equivalent
matching. This is important for applications from searching to
regular expressions.

See also: http://unicode.org/faq/char_combmark.html#7

> Now, the ä character has a precomposed form in Unicode, and if you couple that
> with the NFC normalisation form, you'd get the above _expression_ to return 1.
>
>
> So I'm not sure why the allowance was made for ä as well as other certain
> characters, but not for other things (under-bar characters) that face
> similar representation issues.

It was encoded for compatibility of pre-existing character sets AFAIK.

Regards,
Khaled


>
>
> -------- Original Message --------
> Subject: Re: APL Under-bar Characters
> From: Khaled Hosny 
> Date: Sun, August 16, 2015 8:17 am
> To: alexweiner
> Cc: address@hidden
>
> On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner wrote:
> > Hello Unicode Mailing List,
> >
> > There is significant discussion about the problems of adding capital
> letters
> > with individual under-bars in this mailing list for GNU APL.
> >
> > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html
> >
> > Pretty much it adds up to the following problem:
> >
> > The string length functionality would view an 'A' code point combined
> with an
> > '_' code point as an item that has two elements, while something that
> looks
> > like 'A' Should be atomic, and return a length of one.
>
> I think what you need is better “character” counting [1], rather than
> new precomposed characters.
>
> Regards,
> Khaled
>
> 1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]