Re: Bug reported regarding Unicode handling in email address

nmh-workers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug reported regarding Unicode handling in email address

From:	Steffen Nurpmeso
Subject:	Re: Bug reported regarding Unicode handling in email address
Date:	Mon, 14 Jun 2021 22:38:38 +0200
User-agent:	s-nail v14.9.22-153-g062ff456ec

Ken Hornstein wrote in
 <20210614165452.A056F120B52@pb-smtp20.pobox.com>:
 |>Sure, convert to Unicode, work in Unicode, convert back, that is
 |>the way to go.
 |
 |I know that this is application dependent, but what "work" do you
 |need to perform on the characters?
 |
 |I realized back when I was originally looking at i18n issues in nmh we
 |don't need to perform THAT much work on characters internally.  We DO
 |do some work when it comes to calculating character width in the format
 |engine, but that's all in the native character set.  So I realized that
 |at least for nmh, there's no advantage to converting to Unicode/UTF-8
 |internally, and a number of disadvantages; like you say, the xlocale
 |functions are non-portable and you can't really get there with the
 |existing POSIX APIs.  Converting internally to Unicode would force you
 |to depend on something like ICU.

kre was coming from a "per draft source character set" i think.
But of course, application dependent.  It is more general than "i
really need this now to get nmh (or mailx) going".  When i went
online around 2010 there was a Python member (Murray, who did the
rewrite of the Python mail engine) who was (or is) an nmh user, as
he said.  I looked at nmh but i think it could not even do MIME by then?
Granted i do not know much of nmh.  You definetely need is-space
for line break detection, and if you visualize yourself you need
is-print or is-control etc.  Etc etc, you know at least as well as
i do.  'Just saying.

But even with only columns there are problems, like bidi.
I added "headline-bidi" "support"

  In general setting this variable will cause Mailx to encapsulate
  text fields that may occur when displaying headline[435] (and some
  other fields, like dynamic expansions in prompt[517]) with special
  Unicode control sequences; it is possible to fine-tune the terminal
  support level by assigning a value: no value (or any value other
  than ‘1’, ‘2’ and ‘3’) will make Mailx assume that the terminal is
  capable to properly deal with Unicode version 6.3, in which case
  text is embedded in a pair of U+2068 (FIRST STRONG ISOLATE) and
  U+2069 (POP DIRECTIONAL ISOLATE) characters.  In addition no space
  on the line is reserved for these characters.

  Weaker support is chosen by using the value ‘1’ (Unicode 6.3, but
  reserve the room of two spaces for writing the control sequences
  onto the line).  The values ‘2’ and ‘3’ select Unicode 1.1 support
  (U+200E, LEFT-TO-RIGHT MARK); the latter again reserves room for
  two spaces in addition.

but it is no good here (st(1)).  Best is without it :)

  From steffen Tue Jun  3 13:42:08 2014
  Date: Tue, 03 Jun 2014 13:42:08 +0200
  From: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?= <ex@am.ple>
  To: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
  Subject: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
  MIME-Version: 1.0
  Content-Type: multipart/mixed;
   boundary="=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_"
  Status: R

  This is a multi-part message in MIME format.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: 8bit
  Content-Disposition: inline

  أحمد المحمودي.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: 8bit
  Content-Disposition: attachment;
   filename="أحمد المحمودي.txt"

  أحمد المحمودي.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_--

Nah, *really* proper internationalization is a very complicated
thing, but it seems you can get away with only slightly touching
this in an email program unless you display or edit actual text.
As i have no right-to-left capabilities, i cannot test anyway.

 |>Really, the older i get the more i think that UTF-16 is not the
 |>worst decision regarding Unicode.  Surrogate pairs have to be
 |>handled, but for UTF-8 you always have to live with multibyte
 |>anyway.
 |
 |I guess I think out of all of the possible worlds, UTF-8 is probably
 |the best compromise.

For serialization you are surely right.  This imposes a conversion
back and forth to wchar_t with POSIX interface, then.  And you
have already lost the performance battle.

You know, that is _really_ weird.  Remembering NetBSD pimping
their vis(3) (i was subscribed to their source-changes for years),
vis(3) goes through this:

    /* Allocate space for the wide char strings */
    psrc = pdst = extra = NULL;
    mdst = NULL;
    if ((psrc = calloc(mbslength + 1, sizeof(*psrc))) == NULL)
            return -1;
    if ((pdst = calloc((16 * mbslength) + 1, sizeof(*pdst))) == NULL)
            goto out;
    if (*mbdstp == NULL) {
            if ((mdst = calloc((16 * mbslength) + 1, sizeof(*mdst))) == NULL)
                    goto out;
            *mbdstp = mdst;
    }

And you do not want to look at the rest.  I mean, wow!, that
entirely hammers you off the map!  And not to talk about
longjmp(3) or other signal mess.

The good thing about UTF-16 is that it fits an unsigned short or
say u16, and this thing covers Plane 0, the BMP aka the base which
stores practically all normal languages, out of the box.  Whereas
i had to look but i think Chinese etc could already go for 4 bytes
UTF-8, but definetely 3.  So UTF-8 pretties up stuff for an
all-american or all-english view.  Which is ok, but not nice to
almost all other languages of the world.  I mean that ship has
sailed, and it nicely transposes full Unicode to C-style 8-bit
strings.  perl(1) has fantastic Unicode support (to the best of my
knowledge) via UTF-8 storage (and even used an extended format >5
years ago, where sequences could i think even reach 10 characters
or so).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Bug reported regarding Unicode handling in email address, (continued)
- Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/07

Prev by Date: Re: Bug reported regarding Unicode handling in email address
Next by Date: Re: Bug reported regarding Unicode handling in email address
Previous by thread: Re: Bug reported regarding Unicode handling in email address
Next by thread: Re: Bug reported regarding Unicode handling in email address
Index(es):
- Date
- Thread