/* When calling cygwin executable we need to explicitly convert utf-8
arguments (it's encoding yhat Emacs uses internally and passes args to
external commands, when coding-system-for-write is nil) to utf-16 and
call unicode (wide) API function CreateProcess(W).
That needs to be done, because of this transcoding chain which
migth (and it definitely WILL if args contains unicode, i.e. non
ascii/locale_charset character) result in corrupted args:
WINAPI/OS layer:
multibyte string args (utf-8) -> CreateProcessA():
locale_codepage -> unicode (utf-16)
->
CYGWIN layer:
unicode (utf-16) <-> utf-8 ->
cygwin locale env (LC_XXX, LANG; default: C.UTF-8)
Example #1:
utf-8 string 'žą'; 'ž'(0xC5, 0xBE) 'ą'(0xC4, 0x85) transcoding
(to cygwin locale env charset) chain:
converting #1:
locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
utf-8 string 'žą' in locale codepage (cp1257) represenation: 'žą'
'Å'(0xC5), '¾'(0xBE), 'Ä'(0xC4), '…'(0x85).
string converted to utf-16: 'žą'
U+00C5(Å), U+00BE(¾), U+00C4(Ä), U+2026(…).
utf-16: 'žą': 'Å'(U+00C5), '¾'(U+00BE), 'Ä'(U+00C4), '…'(U+2026).
<->
utf-8 : 'žą': 'Å'(0xC385), '¾'(0xC2BE), 'Ä'(0xC384), '…'(0xE280).
converting #2:
utf-16/utf-8 -> cygwin locale env (LANG = lt_LT.cp1257);
utf-8 string 'žą' (0xC3, 0x85, 0xC2, 0xBE, 0xC3, 0x84, 0xE2, 0x80)
converted to cp1257: 'žą' (0xC5, 0xBE, 0xC4, 0x85)
cp1257 string 'žą' in utf-8 representation: 'žą'; 'ž'(0xC5BE), 'ą'(0xC485)
Although string was (should be) converted to cp1257 (according to
cygwin locale env variables), its original value ('žą'), after transcoding
to cp1257 (in cp1257 representation as it should be), is corrupted and indeed
passed args are (were preserved) in utf-8 encoding.
It's important to note that such "original value preservation" happens
only because of successful circumstances, when we are converting to windows
locale codepage/charset and arg string (utf-8) in windows locale
representation doesn't result in some unconvertible character/combination
(e.g. undefined characters) and it's possible to convert back (from
utf-16/utf-8
to locale charset). Corruption _always_ occurs if we ar converting to other
codepage/charset than the current windows locale codepage.
Consider unsuccessful/erroneous conversion example:
utf-8 string/character 'ĥ' (U+0125) passed to cygwin (utf-8):
utf-8 string 'ĥ'(0xC4A5) in locale codepage (cp1257) representation: 'Ä'
(0xA5('') is undefined in cp1257 and it doesn't map to unicode)
converting #1:
locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
utf-8 string 'ĥ' in cp1257 representation: 'Ä'
string converted to utf-16: 'Ä' (0x00C4, 0xF8FD)
(http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1257.txt)
0xA5 (cp1257) is mapped to 0xF8FD in Unicode (Private Use Area Range:
E000–F8FF)
utf-16: 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
<->
utf-8 : 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
converting #2:
utf-16/utf-8 -> cygwin locale env (LANG = C.UTF-8);
utf-16 string 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
converted to utf-8: 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
So, original string value 'ĥ' is transcoded to an invalid 'Ä' although that
shouldn't happen (as no conversion is supposed; neither implicitly, nor
explicitly)
Concluding all: erroneous conversion _always_ occurs, when we are converting
to codepage/charset other than the current windows locale codepage, although
corruption might occur even if we are not supposed to convert at all
(just pass utf-8 encoded arguments).
*/