bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] Incorrect NFKC case folding


From: Bruno Haible
Subject: Re: [bug-libunistring] Incorrect NFKC case folding
Date: Sat, 24 Feb 2024 01:34:56 +0100

Daurnimator wrote in
<https://lists.gnu.org/archive/html/bug-libunistring/2016-11/msg00005.html>
and
<https://lists.gnu.org/archive/html/bug-libunistring/2016-11/msg00006.html>:

> I ran into what seems to be a case of incorrect NFKC case-folding.
> e.g. 00AD SOFT_HYPHEN is meant to map to the empty string:
> 
> From DerivedNormalizationProps.txt
> 00AD          ; NFKC_CF;                # Cf       SOFT HYPHEN
> 
> ##############################################
> $ cat main.c
> #include <stdio.h>
> #include <unicase.h>
> 
> int main() {
>     uint8_t *before = "\xC2\xAD";
>     size_t len = 20;
>     uint8_t after[20] = {0};
>     u8_casefold(before, 2, NULL, UNINORM_NFKC, after, &len);
>     printf("Was: %s\tBecomes: %s\n", before, after);
>     return 0;
> }
> $ gcc main.c -lunistring && ./a.out
> Was: ­ Becomes: ­
> $ gcc main.c -lunistring && ./a.out | xxd
> 00000000: 5761 733a 20c2 ad09 4265 636f 6d65 733a  Was: ...Becomes:
> 00000010: 20c2 ad0a                                 ...
> ##############################################

Where do you get the expectation from, that u8_casefold maps U+00AD to empty?

> For reference, here is a list of codepoints that seem to have the wrong
> result, it was generated with the following lua 5.3 program:
> 
> ```
> local unistring = require "unistring" -- From
> https://github.com/daurnimator/lua-unistring
> 
> for line in io.lines("DerivedNormalizationProps.txt") do
>     local codepoint, to = line:match("^(%x+) *; NFKC_CF;([%x ]*)")
>     if codepoint then
>         codepoint = tonumber(codepoint, 16)
>         local t = {}
>         for cp in to:gmatch("%x+") do
>             table.insert(t, tonumber(cp, 16))
>         end
>         if utf8.char(table.unpack(t)) ~=
> unistring.casefold(utf8.char(codepoint), nil, "NFKC") then
>             print("FAILED", line)
>         end
>     end
> end
> ```
> 
> FAILED 00AD          ; NFKC_CF;                # Cf       SOFT HYPHEN
> FAILED 034F          ; NFKC_CF;                # Mn       COMBINING
> GRAPHEME JOINER
> FAILED 061C          ; NFKC_CF;                # Cf       ARABIC LETTER MARK
> ...

Ah, you are relying on the NFKC_CF derived property. Well, the Unicode Standard
(15.0, section 3.13, subsections "Default Case Folding" and "Default Caseless 
Matching")
<https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf> p. 
155, 158
explain that there is a variant that "is based on Case_Folding(C),
but also removes any characters which have the Unicode property value
Default_Ignorable_Code_Point = True."
The characters in your list, which have the <quote>wrong result</quote>, are
U+00AD, U+034F, U+061C, etc. are those with property 
Default_Ignorable_Code_Point.

So, it seems that you want u8_casefold to behave like toNFKC_Casefold.

But the Unicode Standard makes it clear that this function has a limited
applicability: it is "designed for best behavior when doing caseless
matching of strings interpreted as identifiers".

Similarly, p.158 says:
"Caseless matching for identifiers can be simplified and optimized by using
the NFKC_Casefold mapping. That mapping incorporates internally the derived
results of iterated case folding and NFKD normalization. It also maps away
characters with the property value Default_Ignorable_Code_Point = True,
which should not make a difference when comparing identifiers."

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]