Re: [Gnu-arch-users] Encoding handling proposal

gnu-arch-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Encoding handling proposal

From:	Alexey N. Solofnenko
Subject:	Re: [Gnu-arch-users] Encoding handling proposal
Date:	Sun, 29 Aug 2004 14:18:48 -0700
User-agent:	Mozilla Thunderbird 0.7+ (Windows/20040825)

There can be a much simpler approach:

A) We still need a metedata (and, if it is mutable, it should beversioned).B) Text files can have optional encoding attribute (or during firstimport the attribute can be guessed by the system).

C) All client computers have files exactly as in the repository.

D) "smart" patch/diff/merge use the encoding attribute to correctlycalculate differences between two files (of the same encoding!).

Without changing content encoding on the fly it works without anyproblems. Maybe it is a good idea not to allow content encodingchanging, because a file with new content encoding is essentially a newfile, which can still look similar to the previous one.


- Alexey.


Marcus Sundman wrote:

Here is my proposal of how *I* think a CM system should handle the "encodingissue" and some related issues. You may have a different opinion, and ifyou do it'd be nice to hear it, but no trolling, please.
(See the "Notes" section below for comments regarding each point.)
A) There should be support for both mandatory and optional metadataattributes associated with each file in the repository.
B) "Content-Type" should be a mandatory metadata string attribute.

C) "Auto-Filter" should be a mandatory metadata boolean attribute.
D) There should be a filter/plugin architecture to enable a transcoding offiles on input and output based on their content-types and user settingsand user-provided parameters.
E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should beprovided by plugins mapped to content-types.
F) Commit comments and other string attributes should use UTF-8.
G) Filenames and paths should use UTF-8 in the repository, and be transcodedto the proper encoding when a client accesses the local file system.
Notes:
A) There are already some mandatory metadata associated with each file. Onesuch attribute is the name of the file.
B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
All text/* types may include the "charset" parameter (MIME defines "charset"as "character encoding" and not just as a simple character set), and ifabsent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8 bits/charwith the most significant bit set to 0 (zero)"), as per RFC 2046.This is a very common and established standard used in many differentsystems including, but not limited to, file managers, http and email.
C) If Auto-Filter is set to "true" then content transcoding will occurbetween the repository and the local system. If it is set to "false" thenno transcoding is done.Each project may have its own default Auto-Filter values for different filetypes.
D) Since editors and other programmers' tools tend to use whatever the localsystem encoding happens to be and a project might include people withdifferent systems there needs to be some transcoding of most text files.The contents of files whose "Auto-Filter" attribute is set to "true" will bestored UTF-8 encoded with U+2028 newlines in the repository and transcodedfrom/to the local encoding and local newlines on input/output. The contentsof files whose "Auto-Filter" attribute is set to "false" will not betranscoded on input/output.Often the proper local encoding and line breaks can be detectedautomatically, but the user should be able to override the auto-detectionin his settings and/or by a parameter to the cm client.
E) E.g. if two files with the content-type "application/vnd.sun.xml.writer"are diffed the system should use a diff plugin that knows how to interpretOpenOffice.org Writer documents. If no such plugin is found it defaults tothe standard diff which regards the files as byte blobs.
F) UTF-8 should be used for communication between the client and the server.Internally the server might store the strings in any encoding it wants inthe repository, but I'd recommend keeping them UTF-8 encoded for simplicityand consistency.
G) Each character in a file name/path not possible to transcode to thetarget file system encoding should be replaced with the character sequence"{uN}" where N is the hexadecimal unicode code (e.g. a file named"hello<>world" would be named "hello{u3C}{u3E}world" on windows). Thisresults in the limitation that filenames must not contain a charactersequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}".Whenever a filename or path is used in an URI the UTF-8 bytes should beproperly URI-encoded.Often the proper local encoding can be detected automatically, but the usershould be able to override the auto-detection in his settings and/or by aparameter to the cm client.Internally the server might store the strings in any encoding it wants inthe repository, but I'd recommend keeping them UTF-8 encoded for simplicityand consistency.
Notice that there is no distinction between "text files" and "binary files".The same system that converts between different text encodings might justas well be used to convert between different "raw" audio formats. Just addthe appropriate plugin/filter and you're set.
- Marcus Sundman


_______________________________________________
Gnu-arch-users mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/gnu-arch-users

GNU arch home page:
http://savannah.gnu.org/projects/gnu-arch/

[Prev in Thread]

Current Thread

[Next in Thread]

[Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
- Re: [Gnu-arch-users] Encoding handling proposal, John Meinel, 2004/08/29
  - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
    - Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/29
    - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
    - Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/29
    - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
    - Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/30
    - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/30
    - Re: [Gnu-arch-users] Encoding handling proposal, Charles Duffy, 2004/08/30
- Re: [Gnu-arch-users] Encoding handling proposal, Alexey N. Solofnenko <=
  - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
- Re: [Gnu-arch-users] Encoding handling proposal, David Allouche, 2004/08/30
  - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/30
- [Gnu-arch-users] Re: Encoding handling proposal, Stefan Monnier, 2004/08/30
- Re: [Gnu-arch-users] Encoding handling proposal, Tom Lord, 2004/08/30

Prev by Date: Re: [Gnu-arch-users] Encoding handling proposal
Next by Date: Re: [Gnu-arch-users] Encoding handling proposal
Previous by thread: Re: [Gnu-arch-users] Encoding handling proposal
Next by thread: Re: [Gnu-arch-users] Encoding handling proposal
Index(es):
- Date
- Thread