Re: index sorting in texi2any in C issue with spaces

bug-texinfo
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: index sorting in texi2any in C issue with spaces

From:	Gavin Smith
Subject:	Re: index sorting in texi2any in C issue with spaces
Date:	Sun, 4 Feb 2024 20:38:28 +0000
On Sun, Feb 04, 2024 at 08:38:45PM +0100, Patrice Dumas wrote:
> >        offer much more powerful solutions to collation issues.
> > 
> > - from "man perlop".)
> 
> Thanks.  This is very confusing to me, then, as it is not told that way
> in perllocale, especially the section: 
> https://perldoc.perl.org/perllocale#Category-LC_COLLATE%3A-Collation%3A-Text-Comparisons-and-Sorting
> There is more information in the end of the page that may correspond
> better to the perlop information.  Not important at all anyway
> since we agree that using the user locale is not a good idea in any case.
> 
> > > Here is my updated thinking on the possibilities
> > > 
> > > 1) lexicographic sorting on unicode strings (corresponds to
> > >                                  USE_UNICODE_COLLATION=0 currently)
> > > 2) unicode default sorting obtained by Unicode::Collate in Perl and
> > >    strxfrm_l in C with "en_US.utf-8", the current default ("en_US.utf-8"
> > >    could be different on different platforms, a list instead of only one
> > >    possibility if "en_US.utf-8" is not always available...)
> > > 3) sorting based on @documentlanguage using, in perl
> > >    Unicode::Collate::Locale with locale @documentlanguage and in C
> > >    strxfrm_l with "@documentlanguage.utf-8" (at least on GNU/Linux,
> > >    the locale name setup for strxfrm_l could be different on other 
> > > platforms).
> > > 4) sorting based on a customization variable, such as COLLATION_LANGUAGE.
> > >    it would be the same as the previous one, with @documentlanguage
> > >    replaced by COLLATION_LANGUAGE.
> > > 5) sorting based on the user locale, using strxfrm in C and
> > >    "use locale" and regular sorting on unicode (internal perl encoded) 
> > > strings
> > >    in Perl.
> > > 
> > 
> > My concern here is that there are far too many options for the user to
> > decide between.  They also interact with whether XS or pure Perl modules
> > are being used (which depends on environment variables such as
> > TEXINFO_XS_STRUCTURE and other things).  As far as possible the
> > interface should not specify whether the sorting is done in C or Perl.
> 
> Ok, in principle, but I am not sure that it is really possible given the
> differences.
> 
> > What's of interest to the user are the following three things: speed,
> > correctness, and language-specific tailoring.
> 
> Point is that C is better for speed, but Perl is better for correctness.
> 
> > I think many possibilities can be covered with three customization
> > variables, USE_UNICODE_COLLATION, COLLATION_LOCALE and COLLATION_LANGUAGE:
> > 
> > 1) would be done with USE_UNICODE_COLLATION=0 as you say.  This could
> > also be implemented in C with strcmp (as Andreas pointed out).
> 
> strcmp is always used as a transformation on the string is done with
> strxfrm_l for the collation in C.  If USE_UNICODE_COLLATION=0 the string
> is not transformed, which amounts to using strcmp on the original
> string.  Therefore it is already implemented that way in C, as can be
> seen in tp/Texinfo/XS/main/manipulate_indices.c.

Does this always happen with "texi2any -c USE_UNICODE_COLLATION=0" if
the XS modules are available or are there more restrictions?

I noticed a potential problem:

static void
set_sort_key (locale_t collation_locale, const char *input_string,
              char **result_key)
{
  if (collation_locale)
    {
  #ifdef HAVE_STRXFRM_L
      size_t len = strxfrm_l (0, input_string, 0, collation_locale);
      size_t check_len;

      *result_key
        = (char *) malloc ((len +1) * sizeof (char));
      check_len = strxfrm_l (*result_key, input_string, len+1,
                             collation_locale);
      if (check_len != len)
        fatal ("strxfrm_l returns a different length");
  #endif
    }
  else
    *result_key = strdup (input_string);
}

It looks like *result_key is not set if HAVE_STRXFRM_L is not defined.

> > 2) is two different types of sorting; as you said earlier, the
> > sorting in C may have a different treatment of "variable elements".
> > The first would be accessed with USE_UNICODE_COLLATION=1 and the second
> > with COLLATION_LOCALE=en_US.UTF-8 or possibly COLLATION_LOCALE=en_US.
> > Using strxfrm with en_US would not be the default because of the handling
> > of spaces and also because the interface isn't very portable.
> 
> So, what you mean here is is that with USE_UNICODE_COLLATION=1 and no
> COLLATION_LOCALE, C code should call the perl code that sort indices using
> Unicode::Locale instead of doing the sorting in C.  Did I get it right?

Yes, that's right.  Since there's no C implementation of Unicode
collation that matches how we use Unicode::Collate to do it, then
Unicode::Collate should be the default.


> If COLLATION_LOCALE is set, in C strxfrm_l would be used to do the
> string transformation and sorting.
> 
> If COLLATION_LOCALE is set in Perl, it is not clear to me what would be
> the output.  Would it be ignored?

If by "set in Perl" you mean in an output converter module that is written
in Perl, then we should try to honour the variable and sort exactly as
specified in the stated locale.  This would probably be done by calling
into C code to do it, or doing it in Perl somehow (perhaps with "use locale"
and "cmp", if that actually works).  If it is not possible to do it in
Perl, and XS modules are not available, it is not a big deal: we just
print a warning message saying that sorting according to a locale's
rules is not available.


> 
> The advantage I see of your proposal is that we would never need to
> select a specific locale, as is done currently with en_US.UTF-8.
> The downside is that in most cases, the users will not get the speed
> increase of using C, as it requires knowing about COLLATION_LOCALE
> which is likely to remain relatively obscure.  This downside is not
> problematic right now, as Perl is more correct.

Yes, that's why I propose keeping the Perl implementation as the default.

> However, if there is a
> possibility to get variable elements set to "non-ignorable" in C,
> possibly by using an hardcoded locale of en_US, it will not possible to
> get automatically both the correct and more rapid option.  The user
> would still have to set COLLATION_LOCALE to get it.

If this is possible, then we silently switch to using the C sorting 
if we can detect that we can treat variable elements in such a way.
This would be the "default", non-tailored collation.

> So, even if it is
> practical for the short time, I wonder if we should not already plan for
> a future in which C would be both correct and more rapid, but still with
> a cumbersome interface that requires setting a specific locale.
> 
> > 3) and 4) are again potentially different between C and pure
> > Perl.  I propose that COLLATION_LOCALE would be used for accessing
> > system locales (with strxfrm or strcoll in C, but in theory this is
> > language-independent.)  COLLATION_LANGUAGE would be an argument to use
> > for Unicode::Collate::Locale to get language-specific tailoring, which
> > in language-independent terms means to use the UCA with tailoring, with
> > variable collation elements treated as "non-ignorable".  If there is
> > ever a separate implementation of the UCA in texi2any with access to
> > tailoring, COLLATION_LANGUAGE would govern it as well.
> 
> If I understand well, COLLATION_LANGUAGE would only change what is done
> in perl, with Unicode::Collate::Locale used if COLLATION_LANGUAGE is
> set.  In that case, since perl is called from C if COLLATION_LOCALE
> is not set, COLLATION_LANGUAGE would apply to C because it calls Perl
> unless COLLATION_LOCALE is set.

Yes, if I understand correctly: a C converter would use the Perl module
for collation as we don't have an equivalent available in C.

> 
> Note that it is not more clear to me what would happen with
> COLLATION_LOCALE in the Perl case.
> 
> > For 3), accessing @documentlanguage seems like an unnecessary extra
> > at the moment.  Again, there would be the problem of strxfrm_l and
> > Unicode::Collate::Locale doing different things with variable collation
> > elements.  There is no guarantee that the user has the appropriate
> > locale installed either (for use with strxfrm_l) 
> 
> It seems to me that following @documentlanguage would be more desirable
> than being able to have the use specify a specific COLLATION_LANGUAGE
> (or COLLATION_LOCALE).  Indeed, it seems to me to be more aligned with
> Texinfo, in which information is supposed to come primarily from the
> Texinfo manual.  Also COLLATION_LANGUAGE and COLLATION_LOCALE suffer from
> the same problems that you describe for @documentlanguage based
> customization.  Also, if COLLATION_LANGUAGE and/or COLLATION_LOCALE is
> implemented, it would be very easy to use what comes from @documentlanguage
> instead for any of these user-supplied values, so it is a bit strange
> not to do it.

There wouldn't any harm in implementing it as an option.  We'd have to
decide if it went via strxfrm_l, Unicode::Collate::Locale, or configurable
for either.

> Lastly, and more importantly, even if it is implemented later, I think
> that the 'interface' with customization variables should be designed now.
> 
> > or that the language
> > is supported by Unicode::Collate::Locale.
> 
> This is not an issue, if not supported, there is a fallback to the
> default behaviour of Unicode::Collate.

OK.

> > > 6) in C use Perl sorting corresponding to 2).
> > >
> > > Could be named 'perldefault'.
> > 
> > I can't understand what you are proposing here.  Is this not just the
> > same as using Unicode::Collate?  What difference does it make if
> > the Unicode::Collate module is called from C or Perl code?
> 
> Speed and consistency.  My idea was that if C is used, it is supposed to
> be used for everything in the default case, with exceptions only when
> needed and mostly for tests (when TEST=1).  But I have no problem if
> things are done differently.

I think that's confusing.  Just because a converter is written in C
doesn't mean that the indices should be sorted differently.  That's a
detail of the implementation that isn't apparent to the user.


> As a side note, transliteration of file names is also different from C
> and from Perl, the Perl function is used if TEST=1, but otherwise the
> result are different if TEXINFO_XS_CONVERT=1.

I don't know what "transliteration of file names" refers to here.  Does
this refer to the --transliterate-file-names option?
[Prev in Thread]
Current Thread
[Next in Thread]
Re: index sorting in texi2any in C issue with spaces, (continued)
Prev by Date: Build from git broken - missing gperf?
Next by Date: Re: index sorting in texi2any in C issue with spaces
Previous by thread: Re: index sorting in texi2any in C issue with spaces
Next by thread: Re: index sorting in texi2any in C issue with spaces
Index(es):
- Date
- Thread