coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte support (round 3)


From: Pádraig Brady
Subject: Re: multibyte support (round 3)
Date: Mon, 19 Sep 2016 14:25:24 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 19/09/16 07:11, Assaf Gordon wrote:
> Hello,
> 
> Updated patch attached.
> 
> Improvements from last time ( 
> http://lists.gnu.org/archive/html/coreutils/2016-09/msg00011.html ):
> 
> 1. 'multibyte' and 'mbbuffer' are in gl/ , behave more like gnulib modules.
> Tests cover all items mentioned in Markus Kuhn's UTF-8 decoder page
> (https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt).
> 
> 2. cygwin/UTF-16 surrogates are handled transparently in 'mbbuffer'.
> Applications under cygwin see 'ucs4_t' and don't need to worry about 
> surrogates (but, wcwidth() will present some problem). Tests ensure parsing 
> under cygwin behaves like other systems.
> 
> 3. 'cut' supports multibyte '-c' and '-n -b' (but not multibyte '-d' yet).
> Some tests included.

Very nice work, especially on the tests.

A general point is I'm still of the opinion that
it would be better to have all conversion and checking
in unorm(1), thus simplifying/optimizing the checking/processing
in all other utils.  It would be good to get an idea
of performance/overhead as the patches progress.

I'm thinking this could be merged in the next major
version of coreutils, which would come after the
next minor release which hopefully will be released
in the next few weeks.

A couple of points on very quick review:

is_utf8_locale_name()
  In gnulib, "UTF-8" is commented as the only
  variant that needs to be checked from the
  return of locale_charset()

is_valid_mb_character()
  There can invalid characters in many single byte locales:
  For example 81,8d,90,9d,93 in cp1252, as shown by:
  recode -lh cp1252 | grep -C1 -F "     "

mbtowc_utf16()
  The assert() triggers -Werror=suggest-attribute=noreturn
  when not on cygwin, so it's better to avoid compiling
  that function altogether on other platforms.  I.E. put
  #ifdef HAVE_UTF16_SURROGATES around the function definition
  (and declaration to get a compile error rather than a link error)

thanks!
Pádraig




reply via email to

[Prev in Thread] Current Thread [Next in Thread]