tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre


From: 张博洋
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Date: Sun, 3 Sep 2017 17:56:57 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

Hello,

Decoding UTF-8 is not that hard. The codespace is limited to 0 ~ 0x10FFFF in 2003 (reference: https://en.wikipedia.org/wiki/UTF-8), thus if a UTF-8 sequence is well-formed, it's highest possible length for single character is 4-bytes. The latest Unicode Standard also expressed the same thing.

MB_LEN_MAX is the max number across all locales, so it's value might higher than 4. But it doesn't matter because we are focused on UTF-8.

I downloaded the Unicode 10.0 Core Specification from: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf. And I refactored my code in a easy-to-verify way. The UTF-8 related contents are in ch3.9 (page 125). In the new code, validness checks can be verified using Table 3-7. I also provided 'test-ill-formed.c' for Table 3-8. I think the refactored code can handle all valid UTF-8 sequences.


Zhang Boyang

在 2017年09月03日 13:50, Christian Jullien 写道:
Managing UTF-8 (and Unicode) correctly on all platforms is a nightmare. I did 
it only partially for my Lisp.
It's hard to say that your code is correct or not but I have the impression it 
is not since you don’t use MB_LEN_MAX nor MB_CUR_MAX. Hence you don't handle 
all possible multi-bytes character len.
There is a system dependent constant named MB_LEN_MAX that tells you the max 
number of multi-bytes. (see for example 
http://man7.org/linux/man-pages/man3/MB_LEN_MAX.3.html)
As you can read here it must be used with MB_CUR_MAX, a locale dependent value. With 
"most common" locales you can leave with 5 to 6 bytes but I'm discovering that 
MB_LEN_MAX is now 16 on Linux!!!

 From Linux <limits.h>

/* Maximum length of any multibyte character in any locale.
    We define this value here since the gcc header does not define
    the correct value.  */
#define MB_LEN_MAX      16

 From VC++ 14

#define MB_LEN_MAX    5             // max. # bytes in multibyte char

The ISO C standard defines two macros that provide this information.
Macro: int MB_LEN_MAX
MB_LEN_MAX specifies the maximum number of bytes in the multibyte sequence for 
a single character in any of the supported locales. It is a compile-time 
constant and is defined in limits.h.
Macro: int MB_CUR_MAX
MB_CUR_MAX expands into a positive integer expression that is the maximum 
number of bytes in a multibyte character in the current locale. The value is 
never greater than MB_LEN_MAX. Unlike MB_LEN_MAX this macro need not be a 
compile-time constant, and in the GNU C Library it is not.
MB_CUR_MAX is defined in stdlib.h.



If it helps, you can adapt use:

/*
  * Retuns the number of multiple bytes needed to store MB character c.
  */

#define OLMBCLEN_USES_TABLE

#if     defined( OLMBCLEN_USES_TABLE )
static const unsigned char olbytesForUTF8[256] = {
        /* ASCII 7bit char         -> 0xxxxxxx */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 00 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 10 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 20 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 30 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 40 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 50 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 60 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 70 */
        /* invalid UTF-8 char      -> 10xxxxxx */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 80 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 90 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* A0 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* B0 */
        /* (c & 0xE0) == 0xC0      -> 110xxxxx */
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* C0 */
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* D0 */
        /* (c & 0xF0) == 0xE0      -> 1110xxxx */
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* E0 */
        /* (c & 0xF8) == 0xF0      -> 11110xxx */
#if     (OLMB_LEN_MAX == 4)
        4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0  /* F0 */
#else
        4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0  /* F0 */
#endif
};

size_t
olmbclen( int c )
{
        return( (size_t)olbytesForUTF8[ c & 0xFF ] );
}

#else
size_t
olmbclen( int c )
{
        if ((c & 0x80) == 0x00) {
                return( 1 );
        } else  if( (c & 0xE0) == 0xC0) {
                return( 2 );
        } else  if( (c & 0xF0) == 0xE0 ) {
                return( 3 );
        } else  if( (c & 0xF8) == 0xF0) {
                return( 4 );
#if     (OLMB_LEN_MAX > 4)
        } else  if ((c & 0xFC) == 0xF8) {
                return( 5 );
#endif
#if     (OLMB_LEN_MAX > 5)
        } else  if ((c & 0xFE) == 0xFC) {
                return( 6 );
#endif
        }

        return( 0 );
}
#endif

-----Original Message-----
From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
Sent: samedi 2 septembre 2017 19:12
To: address@hidden
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled 
incorrectly

Hello,

Here is the new patch, which fixed the UTF-16 truncate problem on Windows.

Zhang Boyang



在 2017年09月01日 19:50, Christian JULLIEN 写道:
Given platforms tcc supports, I think you can assume, wchar_t uses 2 bytes on 
Windows and 4 bytes on all other platforms (I'm not totally sure, but think you 
can force wchar_t to be 2 bytes on macOS).
I've never heard about other implementation for wchar_t (I don't recall how zOS 
encodes wchar_t but I doubt someone will port tcc on this system which still 
uses EBCDIC natively).


   Le :&nbsp;01 septembre 2017 à 11:02 (GMT +02:00) De :&nbsp;"张博洋"
&lt;address@hidden&gt; À :&nbsp;"address@hidden"
&lt;address@hidden&gt; Objet :&nbsp;Re: [Tinycc-devel] BUG:
wide char in wide string literal handled
   incorrectly


Hello,

Thanks for your reply.

My assumptions only applicable to wide string literals. The behavior
for plain strings literals of both original tcc and my patched tcc is
"copy bytes in plain string as is". And for wide strings, original tcc
"read each char and cast them to wchar_t", my patched tcc "decode them
as
UTF-8 sequences".

After some consideration, I found the assumption I made was "wide
string literals are written in UTF-8, and wchar_t is always UTF-32".
That leads to two problems. First, wide string encoding in source file
is definitely same as the encoding of source file, which might not be
UTF-8. This will cause problems as you mentioned. Second, wchar_t is
not always UTF-32. It's UTF-16 on Microsoft Windows. Some chars, like
emojis , will get corrupted because of value truncation.

Although there are problems, if the second problem got fixed (which is
easy), my patched tcc will always perform better than original tcc. If
something breaks, it will also breaks on original tcc. I provided a
table in attachments describing every situation and corresponding behaviors.

The ideal solution is to provide charset options as you mentioned.
After doing some search on internet, I found that there are 3 command
line options that controls char encoding:
-fexec-charset=charset
-fwide-exec-charset=charset
-finput-charset=charset
In order to make these feature works correctly, tcc must do two conversions:
(1) convert all plain string literal from input-charset to
exec-charset
(2) convert all wide string literal from input-charset to
wide-exec-charset However, providing these feature requires external
libraries like iconv, doing this might make Tiny C Compiler not tiny.

My problems are:
(1) Is wchar_t either UTF-32 or UTF-16 on all platforms?
(2) Should we provide full support for charset using external librarys?


Thanks
Zhang Boyang



在 2017年09月01日 11:54, Christian Jullien 写道:
&gt; Hello,
&gt;
&gt; I'm not sure you can assume that a character having code &gt;= 0x80 is part of UTF-8. Beyond 
what is called "basic character set" which is globally the ASCII 7bits, there is the 
"extended character set" which is implementation defined.
&gt;
&gt; For example, the euro sign EUR may be part of 8859-15 and
perfectly well encoded on 8bits with 0xA4 see
https://en.wikipedia.org/wiki/ISO/IEC_8859-15
&gt;
&gt; Microsoft VC++ has the following flags &gt; &gt; /utf-8 set
source and execution character set to UTF-8 &gt; /validate-charset[-]
validate UTF-8 files for only legal characters &gt; &gt; That controls
how source code is encoded.
&gt;
&gt; gcc (more specifically cpp the C preprocessor) processes source
file using UTF-8 but, as VC++ has a flag to control input-char &gt;
&gt;         -finput-charset=charset
&gt;             Set the input character set, used for translation from the
&gt;             character set of the input file to the source character set 
used by
&gt;             GCC.  If the locale does not specify, or GCC cannot get this
&gt;             information from the locale, the default is UTF-8.  This can be
&gt;             overridden by either the locale or this command-line option.
&gt;             Currently the command-line option takes precedence if there's a
&gt;             conflict.  charset can be any encoding supported by the 
system's
&gt;             "iconv" library routine.
&gt;
&gt; Now, tcc should be compatible with both. I mean:
&gt;
&gt; - Native Windows tcc port should NOT assume characters are UTF-8
encoded and -utf-8 flag should change this behavior (+
-finput-charset=xxx for gcc compatibility) &gt; - Other ports (I mean
Linux &amp; alt.) should assume characters are UTF-8 encoded and -finput-charset=xxx 
flag should change this behavior (+ -utf-8 for VC++ compatibility) &gt; &gt; To 
summarize, which should add both utf-8 and -finput-charset=xxx support and set the default 
behavior based on native port.
&gt;
&gt; Wdyt?
&gt;
&gt; Christian
&gt;
&gt;
&gt; -----Original Message-----
&gt; From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
&gt; Sent: mercredi 30 août 2017 09:31 &gt; To:
address@hidden &gt; Subject: [Tinycc-devel] BUG: wide char in
wide string literal handled incorrectly &gt; &gt; Hello, &gt;
&gt;     I found that when TCC processing wide string literal, it behaves like 
directly casting each char in original file to wchar_t and store them in wide 
string. This will work for ASCII chars. However, it might not work for real wide 
chars. For example:
&gt;     The Euro-sign (EUR, U+20AC) stored in UTF-8 is "E2 82 AC". In GCC, this char stored in 
wide string will be "000020AC". However, in TCC, this char is stored as 3 wide chars "000000E2 
00000082 000000AC".
&gt;     I provided a patch, a test program and two screenshots that describe this 
problem, they are in attachments. I solve this problem by making assumptions that input 
charset is UTF-8. Although it's not a perfect solution, it's still better than "directly 
casting char to wchar_t". I'm wondering if that is appropriate, so please review the 
code carefully.
&gt;
&gt; Thanks
&gt; Zhang Boyang
&gt;
&gt;
&gt; _______________________________________________
&gt; Tinycc-devel mailing list
&gt; address@hidden
&gt; https://lists.nongnu.org/mailman/listinfo/tinycc-devel
&gt;

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel



_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


--
张博洋 - 复旦大学2014级计算机科学与技术
我的手机: 18600020982
我的个人网站: http://www.zbyzbyzby.com


_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

Attachment: test-emoji.c
Description: Text Data

Attachment: test-ill-formed.c
Description: Text Data

Attachment: utf8-new-refactored.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]