[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Implemented] [coreutils] Partial UTF-8 support for "cut -c"
From: |
jaime.mosquera |
Subject: |
[Implemented] [coreutils] Partial UTF-8 support for "cut -c" |
Date: |
Mon, 12 Aug 2019 21:19:54 +0200 (CEST) |
Good evening.
I have partially implemented the option "-c" ("--characters") for UTF-8
non-ASCII characters, so that using a text in any language other than English
does not result in rather subtle bugs ("cut -c 1-79" produces 79 characters,
except that lines with one accented letter are one character short;
furthermore, depending on where you cut, you may get "partial", unprintable
characters). My modifications are attached as a patch file (created through
git) to the last version found on GitHub (as cloned earlier today).
This implementation has two, somewhat important shortcomings:
* Other encodings are not implemented. It should not be too difficult to
implement UTF-16, and UTF-32 definitely less so, but branching between them
would make the code a bit more difficult to understand and require a simple way
to detect the current encoding and act accordingly. Furthermore, more encodings
would be needed (Japan still uses non-Unicode encodings with some frequency),
so I decided to stick with just UTF-8.
* Modifier characters are treated as individual characters, instead of being
processed along with the characters they modify, as Unicode dictates.
Decisively, many languages from Western Europe (Spanish, Portuguese...) might
or might not work with this program, depending on which kind of accented
letters are produced (on my computer it worked perfectly).
On the other hand, missing bytes in a multibyte UTF-8 characters are correctly
handled (the incomplete character is printed, but the next character is read
whole, without misreading any bytes as part of the previous character).
It is my hope that you should find this first approach to the problem
sufficient for most uses, and incorporate it into the mainstream code.
Greetings.
(Should my modifications be big enough to require it for copyright reasons, my
name is "Jaime Mosquera", and I obviously agree to the terms of the GNU GPL.)
utf8cut.patch
Description: Text Data
- [Implemented] [coreutils] Partial UTF-8 support for "cut -c",
jaime.mosquera <=