info-mtools
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [mtools] Short filenames, codepages and possible mtools/kernel bug


From: Alain Knaff
Subject: Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Date: Mon, 29 May 2006 12:57:35 +0200
User-agent: Thunderbird 1.5 (X11/20051201)

David C Niemi wrote:

This probably goes back to a shortcut I took in 1994 to get VFAT working on Mtools. As you may know VFAT uses 16-bit Unicode, and I assumed that the high bits would always be zero. So there's no support for special code pages unless Alain has since added it.

However, mounting the floppy with the kernel MSDOS file system support is a totally separate implementation, by different people.

DCN

Nope, I didn't add any full Unicode support since then...

However, the problem still is weird, because for Ç you don't need unicode. ISO-8859-1, which is supported, should be enough.

VFAT uses a constant 2 byte format for its unicode (UCS-2?), and in this representation, all ISO-8859-1 characters (which include Ç) have their high byte equal to zero.

The same is not true with variable-length unicode encoding (UTF-8), which add an escape byte to all characters from 0x80 to 0xff.


I tried reproducing the problem here, but I do get a Ç as I should.

[...]
Then I swap over to Linux, and run "mdir a:". What I now see is:

AB�DE    TXT         0 2006-05-28  16:00  AB�DE.TXT
       1 file                    0 bytes
                         1 457 664 bytes free

It's not necessarily an mtools problem, it could also be a terminal (konsole, gterm, ...) issue.

Try doing mdir a: | hexdump -C

If you see C7 for the Ç, it is ok (and the mess up only happened on display), if something else, then it is indeed an mtools bug.


(the capital C cedilla has been replaced by a tiny white question mark
inside a black diamond/lozenge). Just to check, I mount the filesystem
using the following command:

mount -t msdos -o codepage=850 /dev/fd0 temp

Try mount -t vfat instead to get long names and extended characters)


Then, ls shows me a question mark where the capital C cedilla should be.

That's an ls issue (not an msdos/vfat filesystem issue). Ls replaces, _on_display_ , those characters that it thinks are unprintable with question marks. Depending on your settings (LANG, LC_CTYPE and LC_ALL environment variables), ls may think that the Ç is an unprintable character, and replace it by a question mark. This even happens on native Linux filesystems (reiserfs, etc...). Try it by creating a file with a Ç in it, and then doing ls.

I've found that with LC_ALL=en_US , the Ç is displayed correctly.

If that doesn't help, try ls -b instead. Ls -b substitutes "unprintable" characters with their octal code (Should be \307 in case of Ç).

[...]
25F0  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
 ................
2600  E5 41 00 42  00 C7 00 44   00 45 00 0F  00 19 2E 00
                         ^^
 .A.B...D.E......

The C7 is the proper (unicode, iso-8859-1) representation for Ç, so everything should be ok there...


2610  54 00 58 00  54 00 00 00   FF FF 00 00  FF FF FF FF
 T.X.T...........
2620  E5 42 80 44  45 20 20 20   54 58 54 20  00 30 03 80
 .B.DE   TXT .0..
2630  BC 34 BC 34  00 00 04 80   BC 34 00 00  00 00 00 00
 .4.4.....4......
2640  41 41 00 42  00 C7 00 44   00 45 00 0F  00 19 2E 00
 AA.B...D.E......
2650  54 00 58 00  54 00 00 00   FF FF 00 00  FF FF FF FF
 T.X.T...........
2660  41 42 80 44  45 20 20 20   54 58 54 20  00 30 03 80
 AB.DE   TXT .0..
2670  BC 34 BC 34  00 00 04 80   BC 34 00 00  00 00 00 00
 .4.4.....4......
2680  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
 ................

I assume that the "80"s in between the "42"s and the "44"s are my
missing capital C cedillas (both codepages 437 and 850 list the capital
C cedilla as occupying point 80 hex).

The 80 is indeed very confusing. At first I was confused by this too, as I assumed this to be an "unknown character" placeholder.

However, after further analysis, I noticed that 0x80 is indeed the correct legacy MS-DOS code for Ç, as surprising as it sounds. (MS-Dos didn't use standard ISO-8859-1, but its own proprietary encoding, as specified in the codepage...)

If you use a different example than Ç (such as for example é), you see a different code there.


In case it helps, I've left a truncated binary disk image of the
diskette here:
http://www.carbon.eclipse.co.uk/msdosfs.diskImage

Just tried to do an mdir on it... and indeed, it showed me Ç:

> mdir -i msdosfs.diskImage ::
 Volume in drive : has no label
 Volume Serial Number is 2C2F-5EDB
Directory for ::/

ABÇDE    TXT         0 2006-05-28  16:00  ABÇDE.TXT
        1 file                    0 bytes
                          1 457 664 bytes free


Could anyone please tell me whether this is my error, or is it a bug
(possibly in mtools, possibly in the kernel)?

It suspect the error might be in the terminal program that you are using (which might be set to display UTF-8. Try changing that to ISO-8859-1 a.k.a Iso-Latin-1)

Regards,

Alain

_______________________________________________
mtools mailing list
address@hidden
http://www.tux.org/mailman/listinfo/mtools


reply via email to

[Prev in Thread] Current Thread [Next in Thread]