[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bug reported regarding Unicode handling in email address
From: |
Steffen Nurpmeso |
Subject: |
Re: Bug reported regarding Unicode handling in email address |
Date: |
Mon, 14 Jun 2021 22:38:38 +0200 |
User-agent: |
s-nail v14.9.22-153-g062ff456ec |
Ken Hornstein wrote in
<20210614165452.A056F120B52@pb-smtp20.pobox.com>:
|>Sure, convert to Unicode, work in Unicode, convert back, that is
|>the way to go.
|
|I know that this is application dependent, but what "work" do you
|need to perform on the characters?
|
|I realized back when I was originally looking at i18n issues in nmh we
|don't need to perform THAT much work on characters internally. We DO
|do some work when it comes to calculating character width in the format
|engine, but that's all in the native character set. So I realized that
|at least for nmh, there's no advantage to converting to Unicode/UTF-8
|internally, and a number of disadvantages; like you say, the xlocale
|functions are non-portable and you can't really get there with the
|existing POSIX APIs. Converting internally to Unicode would force you
|to depend on something like ICU.
kre was coming from a "per draft source character set" i think.
But of course, application dependent. It is more general than "i
really need this now to get nmh (or mailx) going". When i went
online around 2010 there was a Python member (Murray, who did the
rewrite of the Python mail engine) who was (or is) an nmh user, as
he said. I looked at nmh but i think it could not even do MIME by then?
Granted i do not know much of nmh. You definetely need is-space
for line break detection, and if you visualize yourself you need
is-print or is-control etc. Etc etc, you know at least as well as
i do. 'Just saying.
But even with only columns there are problems, like bidi.
I added "headline-bidi" "support"
In general setting this variable will cause Mailx to encapsulate
text fields that may occur when displaying headline[435] (and some
other fields, like dynamic expansions in prompt[517]) with special
Unicode control sequences; it is possible to fine-tune the terminal
support level by assigning a value: no value (or any value other
than ‘1’, ‘2’ and ‘3’) will make Mailx assume that the terminal is
capable to properly deal with Unicode version 6.3, in which case
text is embedded in a pair of U+2068 (FIRST STRONG ISOLATE) and
U+2069 (POP DIRECTIONAL ISOLATE) characters. In addition no space
on the line is reserved for these characters.
Weaker support is chosen by using the value ‘1’ (Unicode 6.3, but
reserve the room of two spaces for writing the control sequences
onto the line). The values ‘2’ and ‘3’ select Unicode 1.1 support
(U+200E, LEFT-TO-RIGHT MARK); the latter again reserves room for
two spaces in addition.
but it is no good here (st(1)). Best is without it :)
From steffen Tue Jun 3 13:42:08 2014
Date: Tue, 03 Jun 2014 13:42:08 +0200
From: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?= <ex@am.ple>
To: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
Subject: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_"
Status: R
This is a multi-part message in MIME format.
--=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
أحمد المحمودي.
--=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment;
filename="أحمد المحمودي.txt"
أحمد المحمودي.
--=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_--
Nah, *really* proper internationalization is a very complicated
thing, but it seems you can get away with only slightly touching
this in an email program unless you display or edit actual text.
As i have no right-to-left capabilities, i cannot test anyway.
|>Really, the older i get the more i think that UTF-16 is not the
|>worst decision regarding Unicode. Surrogate pairs have to be
|>handled, but for UTF-8 you always have to live with multibyte
|>anyway.
|
|I guess I think out of all of the possible worlds, UTF-8 is probably
|the best compromise.
For serialization you are surely right. This imposes a conversion
back and forth to wchar_t with POSIX interface, then. And you
have already lost the performance battle.
You know, that is _really_ weird. Remembering NetBSD pimping
their vis(3) (i was subscribed to their source-changes for years),
vis(3) goes through this:
/* Allocate space for the wide char strings */
psrc = pdst = extra = NULL;
mdst = NULL;
if ((psrc = calloc(mbslength + 1, sizeof(*psrc))) == NULL)
return -1;
if ((pdst = calloc((16 * mbslength) + 1, sizeof(*pdst))) == NULL)
goto out;
if (*mbdstp == NULL) {
if ((mdst = calloc((16 * mbslength) + 1, sizeof(*mdst))) == NULL)
goto out;
*mbdstp = mdst;
}
And you do not want to look at the rest. I mean, wow!, that
entirely hammers you off the map! And not to talk about
longjmp(3) or other signal mess.
The good thing about UTF-16 is that it fits an unsigned short or
say u16, and this thing covers Plane 0, the BMP aka the base which
stores practically all normal languages, out of the box. Whereas
i had to look but i think Chinese etc could already go for 4 bytes
UTF-8, but definetely 3. So UTF-8 pretties up stuff for an
all-american or all-english view. Which is ok, but not nice to
almost all other languages of the world. I mean that ship has
sailed, and it nicely transposes full Unicode to C-style 8-bit
strings. perl(1) has fantastic Unicode support (to the best of my
knowledge) via UTF-8 storage (and even used an extended format >5
years ago, where sequences could i think even reach 10 characters
or so).
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
- Re: Bug reported regarding Unicode handling in email address, (continued)
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/12
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/12
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/13
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/13
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address,
Steffen Nurpmeso <=
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/15
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/15
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Steffen Nurpmeso, 2021/06/14
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/12
Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/07