Re: [Gnu-arch-users] Encoding handling proposal

gnu-arch-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Encoding handling proposal

From:	Tom Lord
Subject:	Re: [Gnu-arch-users] Encoding handling proposal
Date:	Mon, 30 Aug 2004 13:13:47 -0700 (PDT)

    > From: Marcus Sundman <address@hidden>

    > A) There should be support for both mandatory and optional metadata 
    > attributes associated with each file in the repository.

Agreed.

    > B) "Content-Type" should be a mandatory metadata string attribute.

Quite possibly.   Other alternatives should be explored.   For
example, instead of a content-type, perhaps the name of some region of
the arch namespace?

The arch namespace purports to be a good system for naming
human-constructed artifacts that may evolve over time and relate to
one another (in roughly "branching and merging" type ways).  The set
of valid content-type's is one example of such a class of artifacts.

The question is, so it goes, who is to be master?  Who is to "own"
these standard namespaces, such as that of "content-type's"?

If the answer is "no one", then what is the alternative to a tower of
babble?  

Perhaps the answer is, in part, the arch namespace.  When used
cooperatively, it allows anyone to declare themselves a unique
"authority" about some mapping of name to value.  For whatever
community of reference honors that particular archive registration,
that person therefore *is* an authority, in charge of a shared region
of a cooperatively constructed global namespace.

(And: if the arch namespace is used as the space for "cooperative
standards" -- then it's (still young and emerging) quasi-algebra for
branching and merging enables the possability of "dual citizenship"
between otherwise unlinked communities of cooperation.)

Therefore, the arch namespace is an interesting alternative to
IETF-goverened namespaces.  It's a political question: which approach
is better?  or, better still, can they be usefully combined?

-t





[I'm in a rush with lots to do so, I'll just say I haven't read the
 rest (sorry -- dropped packet) but liked those first couple of points
 and wanted to get my licks in on this topic.]



    > C) "Auto-Filter" should be a mandatory metadata boolean attribute.
    > 
    > D) There should be a filter/plugin architecture to enable a transcoding 
of 
    > files on input and output based on their content-types and user settings 
    > and user-provided parameters.
    > 
    > E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should 
be 
    > provided by plugins mapped to content-types.
    > 
    > F) Commit comments and other string attributes should use UTF-8.
    > 
    > G) Filenames and paths should use UTF-8 in the repository, and be 
transcoded 
    > to the proper encoding when a client accesses the local file system.
    > 
    > 
    > Notes:
    > 
    > A) There are already some mandatory metadata associated with each file. 
One 
    > such attribute is the name of the file.
    > 
    > B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
    > All text/* types may include the "charset" parameter (MIME defines 
"charset" 
    > as "character encoding" and not just as a simple character set), and if 
    > absent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8 
bits/char 
    > with the most significant bit set to 0 (zero)"), as per RFC 2046.
    > This is a very common and established standard used in many different 
    > systems including, but not limited to, file managers, http and email.
    > 
    > C) If Auto-Filter is set to "true" then content transcoding will occur 
    > between the repository and the local system. If it is set to "false" then 
    > no transcoding is done.
    > Each project may have its own default Auto-Filter values for different 
file 
    > types.
    > 
    > D) Since editors and other programmers' tools tend to use whatever the 
local 
    > system encoding happens to be and a project might include people with 
    > different systems there needs to be some transcoding of most text files.
    > The contents of files whose "Auto-Filter" attribute is set to "true" will 
be 
    > stored UTF-8 encoded with U+2028 newlines in the repository and 
transcoded 
    > from/to the local encoding and local newlines on input/output. The 
contents 
    > of files whose "Auto-Filter" attribute is set to "false" will not be 
    > transcoded on input/output.
    > Often the proper local encoding and line breaks can be detected 
    > automatically, but the user should be able to override the auto-detection 
    > in his settings and/or by a parameter to the cm client.
    > 
    > E) E.g. if two files with the content-type 
"application/vnd.sun.xml.writer" 
    > are diffed the system should use a diff plugin that knows how to 
interpret 
    > OpenOffice.org Writer documents. If no such plugin is found it defaults 
to 
    > the standard diff which regards the files as byte blobs.
    > 
    > F) UTF-8 should be used for communication between the client and the 
server. 
    > Internally the server might store the strings in any encoding it wants in 
    > the repository, but I'd recommend keeping them UTF-8 encoded for 
simplicity 
    > and consistency.
    > 
    > G) Each character in a file name/path not possible to transcode to the 
    > target file system encoding should be replaced with the character 
sequence 
    > "{uN}" where N is the hexadecimal unicode code (e.g. a file named 
    > "hello<>world" would be named "hello{u3C}{u3E}world" on windows). This 
    > results in the limitation that filenames must not contain a character 
    > sequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}".
    > Whenever a filename or path is used in an URI the UTF-8 bytes should be 
    > properly URI-encoded.
    > Often the proper local encoding can be detected automatically, but the 
user 
    > should be able to override the auto-detection in his settings and/or by a 
    > parameter to the cm client.
    > Internally the server might store the strings in any encoding it wants in 
    > the repository, but I'd recommend keeping them UTF-8 encoded for 
simplicity 
    > and consistency.
    > 
    > 
    > Notice that there is no distinction between "text files" and "binary 
files". 
    > The same system that converts between different text encodings might just 
    > as well be used to convert between different "raw" audio formats. Just 
add 
    > the appropriate plugin/filter and you're set.
    > 
    > 
    > - Marcus Sundman
    > 
    > 
    > _______________________________________________
    > Gnu-arch-users mailing list
    > address@hidden
    > http://lists.gnu.org/mailman/listinfo/gnu-arch-users
    > 
    > GNU arch home page:
    > http://savannah.gnu.org/projects/gnu-arch/
    > 
    >

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gnu-arch-users] Encoding handling proposal, (continued)
- Re: [Gnu-arch-users] Encoding handling proposal, Alexey N. Solofnenko, 2004/08/29
  - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/29
- Re: [Gnu-arch-users] Encoding handling proposal, David Allouche, 2004/08/30
  - Re: [Gnu-arch-users] Encoding handling proposal, Marcus Sundman, 2004/08/30
- [Gnu-arch-users] Re: Encoding handling proposal, Stefan Monnier, 2004/08/30
- Re: [Gnu-arch-users] Encoding handling proposal, Tom Lord <=

Prev by Date: [Gnu-arch-users] libarch and version string (was Re: libtla, version_string)
Next by Date: Re: [Gnu-arch-users] POSIX spec expert needed
Previous by thread: [Gnu-arch-users] Re: Encoding handling proposal
Next by thread: [Gnu-arch-users] tla-status 2 liner
Index(es):
- Date
- Thread