coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] Multibyte support for expand and unexpand v2


From: Ondrej Oprala
Subject: [PATCH] Multibyte support for expand and unexpand v2
Date: Tue, 29 Sep 2015 10:47:43 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0

Hi all,
this is a loose continuation of my patch from 2 years ago [1]. I'm reposting the patches, but rewrote them to only use gnulib's mbfile (and thus implicitly mbchar) modules instead of being linked with libunistring. Although libunistring is extremely lightweight as opposed to other solutions such as libicu, it IMHO still brings too much overhead for a utility as simple as {,un}expand. We do not do any character classification for {,un}expand (from a Unicode standpoint) and all we want to know is a character's column width and whether it is a tab/space. mbchar already provides this + it can transparently work with non-Unicode input as well (see tests). I've included some changes proposed by Pádraig when I first posted these, so I am listing him as a co-author.

The code flow basically didn't change at all and there is no code duplication. Thus, pure ASCII input is processed in the same manner as any other combination of input characters.

RFC:
* should we expect non-POSIX whitespace in parse_tab_stops() as well, or is that where we should draw the line? * BOMs - there is already a RH BZfor {,un}expand (#1158494) that basically claims that the UTF-8 BOM header should be honored even when the utils are run under different locale settings. Seems some editors do this (kate, emacs) and even utf-8 enabled terminals interpret it, even when encoding in their settings is set to a different one (konsole), unless I filter it back with luit. I personally, am against this special casing, as this would IMHO have no end. Soon, someone with a GB10030-encoded file will come around claiming we shouldn't interpret its first three bytes as a BOM when running under a UTF-8 locale etc...

Thanks for any and all comments,
 Ondrej

[1] https://lists.gnu.org/archive/html/coreutils/2013-02/msg00102.html

Attachment: 0001-expand-unexpand-add-multibyte-support.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]