[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Getting Emacs to play nice with Hunspell and apostrophes

From: Yuri Khan
Subject: Re: Getting Emacs to play nice with Hunspell and apostrophes
Date: Sat, 14 Jun 2014 14:11:51 +0700

On Sat, Jun 14, 2014 at 9:38 AM, Emanuel Berg <address@hidden> wrote:
> Yuri Khan <address@hidden> writes:
>> You could order a book in an Internet shop, have them
>> completely b0rk up the encoding of the shipping
>> address:
>> Then somebody at the postal system might decode the
>> characters and the package would still be delivered
>> at the intended address.
> Ha-ha, unbelievable! How did that happen? First you
> wrote in Russian at the Internet shop's web page - then
> it got like that because of them translating Unicode
> (?) to ISO-8859-1 (which is 8-bit, with the ASCII as
> its lower half) - ? Why didn't the Internet shop do it?

First I must say it’s not mine and likely not a common occurrence for
the Russian Post which is nowadays notorious for its lack of customer

In technical terms, I can think of the following sequence of events:

* The user comes to a website containing an order form. (The form
contains a free input <textarea> for the street address and possibly
an <input> for the recipient name, and a <select> for the country. The
latter ensures that the word RUSSIE is printed in its legible form.)
* The user enters her address and name into the web form, in Russian;
also selects Russian Federation from the country dropdown.
* The browser encodes the address in KOI8-R, one of the three code
pages used in Russia. In this encoding, the string Москва (Moscow) has
the following byte representation: ED CF D3 CB D7 C1. (The KOI8-R
encoding was designed in such a way that it remains readable if the
high bit is stripped: mOSKWA. Too bad the links were already
8-bit-clean at the time Harry Potter was published.)
* The browser sends the form data to the web server, labeled as
Content-Type: application/x-www-form-urlencoded; encoding=KOI8-r. (At
that time, Unicode was not as ubiquitous as it is now; browsers
operated in an encoding that best matched the user’s input.)
* The web server passes the form data to the backend script (Perl CGI
or possibly PHP running as a module).
* The backend script disregards the encoding= parameter, reinterprets
the string as if it were encoded in ISO-8859-1 (or possibly
windows-1252, which is an extension of ISO-8859-1). The byte
representation ED CF D3 CB D7 C1 decodes into íÏÓË×Á (small i with
acute, capital I with diaeresis, capital O with acute, capital E with
diaeresis, multiplication sign, capital A with acute). This string
then gets stored in the database (which is likely configured to
operate in ISO-8859-1 or windows-1251) and lives happily ever after.

> Did they actually think that was a language or some
> transcription of Russian?

Most probably, at the time a human being at the merchant side got
involved, the address was already mangled. They did not have the
knowledge of Russian code pages, and decided to make a best reasonable
effort — “send it as is and let those crazy Russians sort it out”.

> How was it translated to
> Russian at the postal office? I can only make out the
> first line: Russia, Moscow.

The package contains two pieces of information — the country name in
French (RUSSIE) and the postal code 119415 — which get the package to
the postal office 119415 at 14 Udaltsov street in Moscow, near the
customer’s place of residence. (Postal codes are unique within Russia,
the first three digits unambiguously identifying the city.) (pin at the post office building).

The worker at the post office might be familiar with both the KOI8-R
and Windows-1250 encoding tables, but that is highly unlikely.

Alternatively, the worker might regard the mysteriously labeled
package as a peculiar form of a substitution cypher puzzle. [Challenge
Accepted] He takes a red pen and starts scribbling right on the

* First, he notices that the two middle letters in the first word are
identical, and guesses that this word must be Rossi[ya] (Russia).
* This allows him to decode two letters of the next word, which can
then be guessed as Moskva (Moscow) — what else could be
* Substituting the known letters into the customer’s first name gives
“Св***а**” (Sv***a**), which the postal worker recognizes as Svetlana
(a fairly common Russian feminine name, and the most common of those
starting with Sv). (The last letter does not match because of
grammatical case declension.)
* This now gives enough information to decode and guess the street as
pr. Vernadskogo and deliver the package to the Moscow State University
dormitory at Vernadskogo, 37 (other marker at the map linked above),
room 1817-1. Probably also lecture Svetlana that, until all web sites
embrace Unicode, it’s safer to write your address in transliteration.

Now, while this all makes for great war stories, it Should Not be
necessary. Unicode should be used in all stages of Internet shop order
processing, and addresses written in any local language should be
deliverable without post office workers having to solve a challenge.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]