gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Encoding handling proposal


From: Alexey N. Solofnenko
Subject: Re: [Gnu-arch-users] Encoding handling proposal
Date: Sun, 29 Aug 2004 14:18:48 -0700
User-agent: Mozilla Thunderbird 0.7+ (Windows/20040825)

There can be a much simpler approach:
A) We still need a metedata (and, if it is mutable, it should be versioned). B) Text files can have optional encoding attribute (or during first import the attribute can be guessed by the system).
C) All client computers have files exactly as in the repository.
D) "smart" patch/diff/merge use the encoding attribute to correctly calculate differences between two files (of the same encoding!).

Without changing content encoding on the fly it works without any problems. Maybe it is a good idea not to allow content encoding changing, because a file with new content encoding is essentially a new file, which can still look similar to the previous one.

- Alexey.


Marcus Sundman wrote:

Here is my proposal of how *I* think a CM system should handle the "encoding issue" and some related issues. You may have a different opinion, and if you do it'd be nice to hear it, but no trolling, please.
(See the "Notes" section below for comments regarding each point.)

A) There should be support for both mandatory and optional metadata attributes associated with each file in the repository.

B) "Content-Type" should be a mandatory metadata string attribute.

C) "Auto-Filter" should be a mandatory metadata boolean attribute.

D) There should be a filter/plugin architecture to enable a transcoding of files on input and output based on their content-types and user settings and user-provided parameters.

E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should be provided by plugins mapped to content-types.

F) Commit comments and other string attributes should use UTF-8.

G) Filenames and paths should use UTF-8 in the repository, and be transcoded to the proper encoding when a client accesses the local file system.


Notes:

A) There are already some mandatory metadata associated with each file. One such attribute is the name of the file.

B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
All text/* types may include the "charset" parameter (MIME defines "charset" as "character encoding" and not just as a simple character set), and if absent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8 bits/char with the most significant bit set to 0 (zero)"), as per RFC 2046. This is a very common and established standard used in many different systems including, but not limited to, file managers, http and email.

C) If Auto-Filter is set to "true" then content transcoding will occur between the repository and the local system. If it is set to "false" then no transcoding is done. Each project may have its own default Auto-Filter values for different file types.

D) Since editors and other programmers' tools tend to use whatever the local system encoding happens to be and a project might include people with different systems there needs to be some transcoding of most text files. The contents of files whose "Auto-Filter" attribute is set to "true" will be stored UTF-8 encoded with U+2028 newlines in the repository and transcoded from/to the local encoding and local newlines on input/output. The contents of files whose "Auto-Filter" attribute is set to "false" will not be transcoded on input/output. Often the proper local encoding and line breaks can be detected automatically, but the user should be able to override the auto-detection in his settings and/or by a parameter to the cm client.

E) E.g. if two files with the content-type "application/vnd.sun.xml.writer" are diffed the system should use a diff plugin that knows how to interpret OpenOffice.org Writer documents. If no such plugin is found it defaults to the standard diff which regards the files as byte blobs.

F) UTF-8 should be used for communication between the client and the server. Internally the server might store the strings in any encoding it wants in the repository, but I'd recommend keeping them UTF-8 encoded for simplicity and consistency.

G) Each character in a file name/path not possible to transcode to the target file system encoding should be replaced with the character sequence "{uN}" where N is the hexadecimal unicode code (e.g. a file named "hello<>world" would be named "hello{u3C}{u3E}world" on windows). This results in the limitation that filenames must not contain a character sequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}". Whenever a filename or path is used in an URI the UTF-8 bytes should be properly URI-encoded. Often the proper local encoding can be detected automatically, but the user should be able to override the auto-detection in his settings and/or by a parameter to the cm client. Internally the server might store the strings in any encoding it wants in the repository, but I'd recommend keeping them UTF-8 encoded for simplicity and consistency.


Notice that there is no distinction between "text files" and "binary files". The same system that converts between different text encodings might just as well be used to convert between different "raw" audio formats. Just add the appropriate plugin/filter and you're set.


- Marcus Sundman


_______________________________________________
Gnu-arch-users mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/gnu-arch-users

GNU arch home page:
http://savannah.gnu.org/projects/gnu-arch/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]