bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] [EXT] Re: BUG: u8_ct_totitle seems to ignore pref


From: Aki Tuomi
Subject: Re: [bug-libunistring] [EXT] Re: BUG: u8_ct_totitle seems to ignore prefix context
Date: Wed, 28 Dec 2022 09:19:06 +0200 (EET)

> On 28/12/2022 08:55 EET Bruno Haible <bruno@clisp.org> wrote:
> 
>  
> Hi,
> 
> Aki Tuomi wrote:
> > I was trying to use u8_ct_totitle to do streaming title casing, but I ran 
> > into issue that this function does not actually consider prefix context 
> > when doing title casing.
> > 
> > Please see attached example. The input file is 
> > https://unicode.org/udhr/d/udhr_fra.txt
> > 
> > Steps to reproduce:
> > 
> > Compile & run the program with the input file
> > 
> > Expected results:
> > 
> > Déclaration Universelle Des Droits De L’homme
> > 
> > Actual results:
> > 
> > DÉclaration Universelle Des DroiTs De L’homm
> 
> There are two things that prevent the function u8_ct_totitle from being useful
> for "streaming" conversion:
> 
>   * The function is not meant to be "restartable" (like mbsnrtowcs or iconv),
>     but is specified to convert an entire substring. Since, as substrings,
>     you chose to pick pieces of 32 units, you see an extra uppercasing every
>     ca. 32 units.
> 
>   * You passed a unicase_empty_suffix_context, since obviously in a streaming
>     conversion you have only a bounded-size lookahead, and with a bounded-
>     size lookahead one cannot do a totitle conversion. See the "After C"
>     conditions of the Unicode Standard,
>     <https://www.unicode.org/versions/Unicode5.0.0/ch03.pdf>, section 3.13,
>     table 3-14 "Context Specification for Casing".
> 
> To implement streaming conversion, I would dissect the input stream into
> pieces where the prefix-context and suffix-context are known to be empty,
> and work on these pieces. For example, a dissection into lines, by the
> getline() function
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html>
> will do this dissection. Or you can use getdelim()
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/getdelim.html>
> with newline and space as separators for this purpose.
> 
> Note again that the pieces returned by getline() or getdelim() can be
> arbitrarily large in the worst case; as mentioned above, the problem
> cannot be solved with bounded-size lookahead.
> 
> Bruno

Ok. I would still like to point out that you get these results when you do 
actual substring titlecasing, like try to titlecase just word 'Droits', and 
pass in 'roits' with prefix context 'D', you will end up with 'Roits'.

Maybe I am just misunderstanding something here? I would've expected this to 
actually return 'roits'.

int main(void) {
  const char *input = "roits";
  unsigned char output[32];
  size_t olen;
  casing_prefix_context_t prefix;
  prefix = u8_casing_prefix_context((const unsigned char*)"D", 1);
  u8_ct_totitle((const unsigned char*)input, strlen(input), prefix, 
unicase_empty_suffix_context, "fr", &uninorm_nfc, output, &olen);
  output[olen] = '\0';
  printf("%s\n", output);
  return 0;
}


Aki



reply via email to

[Prev in Thread] Current Thread [Next in Thread]