bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] BUG: u8_ct_totitle seems to ignore prefix context


From: Bruno Haible
Subject: Re: [bug-libunistring] BUG: u8_ct_totitle seems to ignore prefix context
Date: Wed, 28 Dec 2022 07:55:51 +0100

Hi,

Aki Tuomi wrote:
> I was trying to use u8_ct_totitle to do streaming title casing, but I ran 
> into issue that this function does not actually consider prefix context when 
> doing title casing.
> 
> Please see attached example. The input file is 
> https://unicode.org/udhr/d/udhr_fra.txt
> 
> Steps to reproduce:
> 
> Compile & run the program with the input file
> 
> Expected results:
> 
> Déclaration Universelle Des Droits De L’homme
> 
> Actual results:
> 
> DÉclaration Universelle Des DroiTs De L’homm

There are two things that prevent the function u8_ct_totitle from being useful
for "streaming" conversion:

  * The function is not meant to be "restartable" (like mbsnrtowcs or iconv),
    but is specified to convert an entire substring. Since, as substrings,
    you chose to pick pieces of 32 units, you see an extra uppercasing every
    ca. 32 units.

  * You passed a unicase_empty_suffix_context, since obviously in a streaming
    conversion you have only a bounded-size lookahead, and with a bounded-
    size lookahead one cannot do a totitle conversion. See the "After C"
    conditions of the Unicode Standard,
    <https://www.unicode.org/versions/Unicode5.0.0/ch03.pdf>, section 3.13,
    table 3-14 "Context Specification for Casing".

To implement streaming conversion, I would dissect the input stream into
pieces where the prefix-context and suffix-context are known to be empty,
and work on these pieces. For example, a dissection into lines, by the
getline() function
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html>
will do this dissection. Or you can use getdelim()
<https://pubs.opengroup.org/onlinepubs/9699919799/functions/getdelim.html>
with newline and space as separators for this purpose.

Note again that the pieces returned by getline() or getdelim() can be
arbitrarily large in the worst case; as mentioned above, the problem
cannot be solved with bounded-size lookahead.

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]