bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS


From: Bruno Haible
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 00:43:42 +0200
User-agent: KMail/5.1.3 (Linux/4.4.0-130-generic; KDE/5.18.0; x86_64; ; )

Hi Pádraig,

> I've attached a gnulib patch to document for iscntrl at least.

> +This function does not support arguments outside of the range of the
> +unsigned char type in locales with large character sets, on some platforms.
> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.

In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().

POSIX [1] says
  "The c argument is a type int, the value of which the application shall
   ensure is a character representable as an unsigned char or equal to the
   value of the macro EOF. If the argument has any other value, the behavior
   is undefined."

The term "character" is defined here [2]:
  "A sequence of one or more bytes representing a single graphic symbol or
   control code."

So, in a UTF-8 locale, a "character representable as an unsigned char"
is a byte sequence of length 1, where the single byte has a value in the
range 0x00..0x7F.

For invalid values "the behavior is undefined." You were expecting a value 0.

Now, in the gnulib documentations, what we mention as portability problems
are the cases where
  - the behaviour for valid arguments is different on different platforms, or
  - the boundary between valid and invalid arguments is fuzzy and depends on
    the platform.
IMO there's no point in documenting that a function _really_ has undefined
behaviour when POSIX says that it has undefined behaviour.

> I've also attached an alternative patch for df (in your name).

This patch is correct (because the characters that you test for in c_iscntrl
are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

But it does not catch control characters outside of the ASCII range. It would
make sense to catch these as well. If you want to do that,
'hide_problematic_chars' needs to be rewritten as a loop that iterates across
the multibyte characters. For example with the 'mbiter' module, in
combination with the mb_iscntrl function from the 'mbchar' module. Or
directly with mbrtowc() and iswcntrl().

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
[2] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87



reply via email to

[Prev in Thread] Current Thread [Next in Thread]