multibyte support (round 4)

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

multibyte support (round 4)

From:	Assaf Gordon
Subject:	multibyte support (round 4)
Date:	Sat, 8 Apr 2017 04:58:13 -0400

Hello,

I think that we've handled the low-hanging fruits (e.g. expand/cut/fold) when it
comes to multibyte support in coreutils.
The remaining programs (e.g. sort,join,uniq,tr,od) present some challenges -
both in terms of what is the 'correct' (and useful) behavior,
and in terms of implementation.

I also think a common thread is the combination of these three requirements:
1. Invalid sequences must be handled as single-bytes
2. Can't rely on native wchar_t (e.g. for cygwin) without extra work
3. Can't assume UTF-8 (or even unicode).

Each requirement by itself is not too problematic - but combined
they make a portable and efficient implementation quite cumbersome.

I'd like to ask a heretical question:
what if we can relax these requirements ?
specifically, what if we can agree that on systems where wchar_t
is not sufficient, we only support UTF8 (and thus use gnulib's internal fast 
implementations)?
(I would love to suggest to support only utf8 everywhere, but I'm sure this 
would not be accepted...)

I will continue to work on multibyte support in any case,
but I think it will make things much better if we are not tied by these 
(legacy?) issues.

With a bit of hand-waving, wouldn't it be reasonable to say that the largest 
portion of GNU coreutils users have systems that have both useable wchar_t 
*and* work primarily in UTF-8 ?

At the risk of mixing apples and oranges, checking the encoding for web-sites 
shows that
UTF-8 is clearly dominating over time:
   https://w3techs.com/technologies/details/en-utf8/all/all
   http://pinyin.info/news/2015/utf-8-unicode-vs-other-encodings-over-time/
I know coreutils is not meant for the web, but I hope that it does hint that 
UTF-8 is gaining popularity not only in websites.

Looking at other implementations, some chose to switch to UTF-8 completely 
(e.g. OpenBSD-6, or Linux with musl-libc). Others have useable wchar_t and 
support multibyte processing for a long time (e.g. FreeBSD, Mac OS X).

I have skimmed through past mailing-list discussions, and Eric has been 
replying since about 2006 saying essentially "if someone comes up with 
efficient implementation we'll add it" - but despite many attempts - we still 
don't have it.

It won't be a regression for these few limited systems - because currently 
coreutils doesn't provide any multibyte support.


Lastly,
I've arranged my notes into a web page.
I hope these notes will save some time if others are interested in catching-up 
to the multibyte issue (except for the time it'll take to read my notes (-: ) :
  http://crashcourse.housegordon.org/coreutils-multibyte-support.html


I'm happy to hear comments and feedback.

Thanks for reading so far,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

multibyte support (round 4), Assaf Gordon <=
- Re: multibyte support (round 4), Pádraig Brady, 2017/04/08

Prev by Date: Re: Please, use --check=crc32 or switch to a safe format
Next by Date: Re: multibyte support (round 4)
Previous by thread: [PATCH 0/1] tty: do not provide conflicting information
Next by thread: Re: multibyte support (round 4)
Index(es):
- Date
- Thread