bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Texinfo 7.0.93 pretest available


From: Eli Zaretskii
Subject: Re: Texinfo 7.0.93 pretest available
Date: Mon, 09 Oct 2023 19:37:55 +0300

> From: Bruno Haible <bruno@clisp.org>
> Cc: bug-texinfo@gnu.org
> Date: Mon, 09 Oct 2023 18:15:05 +0200
> 
> Eli Zaretskii wrote:
> > unless the locale's codeset is UTF-8, any character that is not
> > printable _in_the_current_locale_ will return -1 from wcwidth.  I'm
> > guessing that no one has ever tried to run the test suite in a
> > non-UTF-8 locale before?
> 
> I just tried it now: On Linux (Ubuntu 22.04), in a de_DE.UTF-8 locale,
> texinfo 7.0.93 build fine and all tests pass.

de_DE.UTF-8 is a UTF-8 locale.  I asked about non-UTF-8 locales.  An
example would be de_DE.ISO8859-1.  Or what am I missing?

> > Yes, quite a few characters return -1 from wcwidth, in particular the
> > ȷ character above (which explains the above difference).
> 
> This character is U+0237 LATIN SMALL LETTER DOTLESS J. It *should* be
> recognized as having a width of 1 in all implementations of wcwidth.

But if U+0237 cannot be represented in the locale's codeset, its width
can not be 1, because it cannot be printed.  This is my interpretation
of the standard's language (emphasis mine):

  DESCRIPTION

      The wcwidth() function shall determine the number of column
      positions required for the wide character wc. The application
      shall ensure that the value of wc is a character representable
      as a wchar_t, and is a wide-character code corresponding to a
      valid character in the current locale.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RETURN VALUE

      The wcwidth() function shall either return 0 (if wc is a null
      wide-character code), or return the number of column positions
      to be occupied by the wide-character code wc, or return -1 (if
      wc does not correspond to a printable wide-character code).
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since U+0237 is not printable in my locale (it isn't supported by the
system codepage), the value -1 is correct.  Am I missing something?

> There's no reason for it to have a width of -1, since it's not a control
> character.
> There's no reason for it to have a width of 0, since it's not a combining
> mark or a non-spacing character.
> There's no reason for it to have a width of 2, since it's not a CJK character
> and not in a Unicode range with many CJK characters.

I think you assume that all the Unicode letter characters are always
printable in every locale.  That's not what I understand, and iswprint
agrees with me, because I get -1 for U+0237 due to this code:

> >       return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;


> > I don't think the above logic in Gnulib's wcwidth (which basically
> > replicates the logic in any reasonable wcwidth implementation, so is
> > not specific to Gnulib) fits what Texinfo needs.  Texinfo needs to be
> > able to produce output independently of the locale.  What matters to
> > Texinfo is the encoding of the output document, not the locale's
> > codeset.  So I think we should call uc_width when the output document
> > encoding is UTF-8 (which is the default, including in the above test),
> > regardless of the locale's codeset.  Or we could use a simpler
> > approximation:
> > 
> >       return wc == 0 ? 0 : iswcntrl (wc) ? 0 : 1;
> 
> This "simpler approximation" would not return a good result when wc
> is a control character (such as CR, LF, TAB, or such). It is important
> that the caller of wcwidth() or wcswidth() is able to recognize that
> the string as a whole does not have a definite width.

It is still better than returning -1, don't you agree?

But for some reason you completely ignored my more general comment
about what Texinfo needs from wcwidth.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]