bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: library for unicode collation in C for texi2any?


From: Gavin Smith
Subject: Re: library for unicode collation in C for texi2any?
Date: Sat, 14 Oct 2023 19:57:22 +0100

On Thu, Oct 12, 2023 at 11:39:14AM +0200, Patrice Dumas wrote:
> Hello,
> 
> There is a translation to C of texi2any code going on, for the future,
> after the next release, mainly for the conversion to HTML in a first
> step.
> 
> One thing I could not find easily in C is something to replace the
> Unicode::Collate perl module for index entries sorting using 'smart'
> rules for sorting, that could be either found in Gnulib, included easily
> in the Texinfo distribution or would be, in general, installed.  Unless
> I missed something, there is no such facility in libunistring, it seems
> to be in libICU, but I do not know how easy it could be
> integrated/shipped with Texinfo and I do not think that it is installed
> in the general case.

It's all in the future, but I am slightly concerned about is duplicating
in Texinfo existing system facilities.  For example, for avoiding use of
wcwidth, our use of which depends on setting a UTF-8 locale, and using
the wchar_t type.  Is every program that uses wcwidth supposed to supply
their own implementation instead, and isn't this wasteful?

https://www.gnu.org/software/gnulib/manual/html_node/Characters.html
may be informative on the drawbacks of wchar_t.

I have seen implementations of wcwidth and it does not look very large,
so not very wasteful of space for every program to reimplement it using
Unicode code points instead, but still in principle it should be a standard
system library.

Doing collation properly is more complicated than wcwidth, I believe,
using large tables of codepoints.

It seems that the code is already there in the C libraries but
only available through setting the locale.

One option is that we require systems to have a UTF-8 locale installed
to get correct output.  (We'd have to find some other solution for
MS-Windows.)

I don't know if libunistring aspires to become a standard system library
for handling UTF-8 data but if we use it for other UTF-8 processing it
would make sense to use it for collation.

I suggest writing to Bruno Haible to ask if he has plans to include
collation functionality in libunistring in the future.  I am currently
reading through "Unicode Technical Standard #10" and although I don't
understand a lot of it yet, it seems feasible that we could implement it
in C.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]