monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: problems with i18n testsuite


From: graydon hoare
Subject: [Monotone-devel] Re: problems with i18n testsuite
Date: Fri, 23 Apr 2004 12:03:56 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040208)

Robert Bihlmeyer wrote:

Basically, I want to make the point: the LC_CTYPE of your shell need
not match the charset of all your filenames, or the charset of all
your files' contents. And there is no other way to infer a "local
charset".

ok. currently monotone treats LC_CTYPE as the charset for file *names*, but by default does no conversion of the file's internal bytes.

I'm still unclear on what you do with file content. Do you convert
from whatever you assume as the local charset to UTF-8 for storage and
hash computation? Wouldn't that fail horribly for non-text content?

no, we don't convert the bytes inside files by default. we provide a place for users to specify a conversion if they want one to happen, but by default that conversion is empty. the only conversion we do by default is manifest pathname <-> filesystem pathname, and that is text content (very regular text content, in fact).

I'd really like version control systems to get out of the text
conversion business. Either your editor handles that, or you hang
appropriate tools on pre-checkin and post-checkout hooks.

I mostly agree with you here. as I said, mostly we punt this issue to hooks and try not to enforce any specific conversions. as of the win32 branch -- where I noticed we were doing it wrong -- files are always opened in binary mode and there's no converting.

the only thing we need to be sure about wrt. pathnames is that we must have UTF-8 in the files monotone interprets the content of (MT/manifest, MT/work, .mt-attrs). monotone takes those files apart and evaluates them. it reads their contents, semantically. it needs to be able to match regexes against the bytes it finds in a manifest. we'd have to do a lot more contortions if these control files could be in non-UTF-8 charsets.

but that's really all we need. the decision to externalize those path names in the LC_CTYPE charset is just a convenience for mapping to and from UTF-8, when in an environment which doesn't understand it. the convention is certainly not cast in stone. if you prefer we can make it overridable by a hook, or even default to a hook which normally returns UTF-8 too. I just want people with non-UTF-8 "legacy" systems to be somewhat comfortable, and was under the impression that LC_CTYPE would usually hold their preferred representation.

-graydon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]