emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Foreign file names on MS-Windows


From: Eli Zaretskii
Subject: Foreign file names on MS-Windows
Date: Sat, 22 Mar 2008 14:50:59 +0200

This is a bit longish, but there's an important question near the end,
related to the release of Emacs 22.2, so please bear with me.

In the context of this message, ``foreign file names'' means file
names that cannot be expressed using the current system codepage.  For
example, Cyrillic file names on a system whose codepage is 1252 (which
supports only Latin-1 characters).

Problem description: The Windows filesystem holds file names in UTF-16
encoding, which allows it to support file names outside of the current
locale.  Emacs currently uses the ANSI variants of filesystem APIs, so
the file names returned by system calls on which `readdir' (emulated
in src/w32.c) relies are converted by these system calls to the
current codepage.  When Windows encounters characters that cannot be
converted to the current codepage, it converts them to question marks
`?' instead.  A `?' is a character that cannot appear in a valid file
name on a Windows filesystem, so Emacs primitives that are built on
top of `readdir', such as `directory-files-and-attributes',
`directory-files', and file-name completion primitives, all fail for
these file names in different ways: at best these files are silently
omitted from the output, and at worst you see some weirdo error
messages.

A case in point is "C-x d", which on Windows uses `ls' emulation in
ls-lisp, which in turn calls `directory-files-and-attributes': a
simple "C-x d" silently omits foreign file names from the directory
listing, while "C-u C-x d -altr RET" complains about something being
nil instead of a number, and fails to sort the file names as
requested.  This is because `file-attributes' fails for a file name
that includes `?' characters, and `directory-files-and-attributes'
then returns such files without attributes.

Eventually, Emacs 23 should switch to using Unicode APIs to the
filesystem, which will resolve this problem (but we will need to
figure out how not to break W9x versions of Windows, where Unicode
support is an add-on that is typically not installed).

A temporary bandaid, and the only solution that is practical for Emacs
22, is to modify `readdir' to return the 8+3 aliase of the problematic
file name instead of the long name.  The 8+3 aliases use only 7-bit
ASCII characters; they are ugly and butchered to the point of being
unrecognizable, but are otherwise fully functional.  The change below,
which I installed on the trunk, shows how to do that.  After this
change, at least "C-x d", both with and without "C-u", works for me on
directories with such file names.

Now the important question I promised at the beginning: Should we
install this change on the release branch?  Here are the pros and cons
that I could think of for this decision:

Cons:

  . It is too close to release for such non-trivial changes.

  . The affected primitives are used in lots of places, and this
    change could break them, and the Lisp code that uses them.

  . This problem exists in Emacs for a long time, so it's not a big
    deal if it continues to exist some more (until resolved in Emacs
    23).

  . The suggested solution is only partial, and the resulting file
    names are UGLY.

Pros:

  . The bug is quite grave: it causes real data loss.

  . Whatever code uses the affected primitives is probably already
    broken.

  . The change is very simple, so the probability of it being buggy
    is very low (but please eyeball the diffs below to make it lower
    still).

Yidong and Stefan, please decide whether the change below should be
installed on the release branch.

2008-03-22  Eli Zaretskii  <address@hidden>

        * w32.c (readdir): If FindFirstFile/FindNextFile return in
        cFileName a file name that includes `?' characters, use the 8+3
        alias in cAlternateFileName instead.

Index: src/w32.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/w32.c,v
retrieving revision 1.130
diff -u -p -r1.130 w32.c
--- src/w32.c   24 Feb 2008 10:09:03 -0000      1.130
+++ src/w32.c   22 Mar 2008 11:51:07 -0000
@@ -1889,6 +1889,8 @@ closedir (DIR *dirp)
 struct direct *
 readdir (DIR *dirp)
 {
+  int downcase = !NILP (Vw32_downcase_file_names);
+
   if (wnet_enum_handle != INVALID_HANDLE_VALUE)
     {
       if (!read_unc_volume (wnet_enum_handle,
@@ -1923,14 +1925,23 @@ readdir (DIR *dirp)
      value returned by stat().  */
   dir_static.d_ino = 1;
 
+  strcpy (dir_static.d_name, dir_find_data.cFileName);
+
+  /* If the file name in cFileName[] includes `?' characters, it means
+     the original file name used characters that cannot be represented
+     by the current ANSI codepage.  To avoid total lossage, retrieve
+     the short 8+3 alias of the long file name.  */
+  if (_mbspbrk (dir_static.d_name, "?"))
+    {
+      strcpy (dir_static.d_name, dir_find_data.cAlternateFileName);
+      downcase = 1;    /* 8+3 aliases are returned in all caps */
+    }
+  dir_static.d_namlen = strlen (dir_static.d_name);
   dir_static.d_reclen = sizeof (struct direct) - MAXNAMLEN + 3 +
     dir_static.d_namlen - dir_static.d_namlen % 4;
-
-  dir_static.d_namlen = strlen (dir_find_data.cFileName);
-  strcpy (dir_static.d_name, dir_find_data.cFileName);
   if (dir_is_fat)
     _strlwr (dir_static.d_name);
-  else if (!NILP (Vw32_downcase_file_names))
+  else if (downcase)
     {
       register char *p;
       for (p = dir_static.d_name; *p; p++)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]