gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Encoding handling proposal


From: Marcus Sundman
Subject: [Gnu-arch-users] Encoding handling proposal
Date: Sun, 29 Aug 2004 15:13:28 +0300
User-agent: KMail/1.6.2

Here is my proposal of how *I* think a CM system should handle the "encoding 
issue" and some related issues. You may have a different opinion, and if 
you do it'd be nice to hear it, but no trolling, please.
(See the "Notes" section below for comments regarding each point.)

A) There should be support for both mandatory and optional metadata 
attributes associated with each file in the repository.

B) "Content-Type" should be a mandatory metadata string attribute.

C) "Auto-Filter" should be a mandatory metadata boolean attribute.

D) There should be a filter/plugin architecture to enable a transcoding of 
files on input and output based on their content-types and user settings 
and user-provided parameters.

E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should be 
provided by plugins mapped to content-types.

F) Commit comments and other string attributes should use UTF-8.

G) Filenames and paths should use UTF-8 in the repository, and be transcoded 
to the proper encoding when a client accesses the local file system.


Notes:

A) There are already some mandatory metadata associated with each file. One 
such attribute is the name of the file.

B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
All text/* types may include the "charset" parameter (MIME defines "charset" 
as "character encoding" and not just as a simple character set), and if 
absent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8 bits/char 
with the most significant bit set to 0 (zero)"), as per RFC 2046.
This is a very common and established standard used in many different 
systems including, but not limited to, file managers, http and email.

C) If Auto-Filter is set to "true" then content transcoding will occur 
between the repository and the local system. If it is set to "false" then 
no transcoding is done.
Each project may have its own default Auto-Filter values for different file 
types.

D) Since editors and other programmers' tools tend to use whatever the local 
system encoding happens to be and a project might include people with 
different systems there needs to be some transcoding of most text files.
The contents of files whose "Auto-Filter" attribute is set to "true" will be 
stored UTF-8 encoded with U+2028 newlines in the repository and transcoded 
from/to the local encoding and local newlines on input/output. The contents 
of files whose "Auto-Filter" attribute is set to "false" will not be 
transcoded on input/output.
Often the proper local encoding and line breaks can be detected 
automatically, but the user should be able to override the auto-detection 
in his settings and/or by a parameter to the cm client.

E) E.g. if two files with the content-type "application/vnd.sun.xml.writer" 
are diffed the system should use a diff plugin that knows how to interpret 
OpenOffice.org Writer documents. If no such plugin is found it defaults to 
the standard diff which regards the files as byte blobs.

F) UTF-8 should be used for communication between the client and the server. 
Internally the server might store the strings in any encoding it wants in 
the repository, but I'd recommend keeping them UTF-8 encoded for simplicity 
and consistency.

G) Each character in a file name/path not possible to transcode to the 
target file system encoding should be replaced with the character sequence 
"{uN}" where N is the hexadecimal unicode code (e.g. a file named 
"hello<>world" would be named "hello{u3C}{u3E}world" on windows). This 
results in the limitation that filenames must not contain a character 
sequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}".
Whenever a filename or path is used in an URI the UTF-8 bytes should be 
properly URI-encoded.
Often the proper local encoding can be detected automatically, but the user 
should be able to override the auto-detection in his settings and/or by a 
parameter to the cm client.
Internally the server might store the strings in any encoding it wants in 
the repository, but I'd recommend keeping them UTF-8 encoded for simplicity 
and consistency.


Notice that there is no distinction between "text files" and "binary files". 
The same system that converts between different text encodings might just 
as well be used to convert between different "raw" audio formats. Just add 
the appropriate plugin/filter and you're set.


- Marcus Sundman




reply via email to

[Prev in Thread] Current Thread [Next in Thread]