[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Wed, 02 Apr 2003 09:03:31 -0500
Module name: fenfire
Changes by: Benja Fallenstein <address@hidden> 03/04/02 09:03:31
diff -u fenfire/docs/pegboard/canon3_file_format--benja/peg.rst:1.1
--- fenfire/docs/pegboard/canon3_file_format--benja/peg.rst:1.1 Tue Apr 1
+++ fenfire/docs/pegboard/canon3_file_format--benja/peg.rst Wed Apr 2
@@ -4,8 +4,8 @@
:Author: Benja Fallenstein
-:Revision: $Revision: 1.1 $
-:Last-Modified: $Date: 2003/04/01 19:46:07 $
+:Revision: $Revision: 1.2 $
+:Last-Modified: $Date: 2003/04/02 14:03:30 $
@@ -21,6 +21,89 @@
This PEG specifies such a format.
+- Does this also cover bags and sequences? Reification?
+ RESOLVED: Of course. All RDF structures (anything
+ that can be serialized as triples) can be
+ represented as Canon3.
+- Do we really need a new format?
+ RESOLVED: None of the existing formats are canonical.
+- How compatible is this with N3 and NTriples?
+ What are the differences?
+ RESOLVED: NTriples is encoded in US-ASCII and
+ doesn't allow for multi-line literals. N3 cannot
+ refer to anonymous nodes. An N3 processor
+ will be able to read any Canon3 file that does
+ not contain anonymous nodes (except if the
+ Unicode LINE SEPARATOR character is used,
+ which is not allowed by N3).
+ (Anonymous nodes in Canon3 are represented
+ as in NTriples.)
+- Should the encoding allowed to be different?
+ RESOLVED: No, since that would lose both
+ canonicality and compatibility with N3.
+- Is UTF-8 always sufficient?
+ RESOLVED: UTF-8 can represent all of Unicode and
+ RDF uses Unicode only; therefore, yes.
+- Is quoting with three quotes really what we want?
+ RESOLVED: Multiline literals is really what
+ we want-- imagine you have a 1K HTML document
+ as a literal and the encoder puts it all
+ in one line. (Also, with multiline literals,
+ CVS's diffs are more useful.)
+ Multiline literals are enclosed in three quotes in N3.
+- Does the specification need to talk about equal triples
+ occuring in the same graph? Can the same triple
+ occur twice, according to the RDF spec?
+ RESOLVED: There are tools which allow a single triple
+ to occur twice. Therefore, the spec should be clear
+ about the topic.
+- Why `Normalization Form C`_?
+ RESOLVED: Because it's required by N3, and because
+ it's the standard on the Web (http://www.w3.org/TR/charmod/).
+- Does it allow for the different newline conventions?
+ RESOLVED: Yes. (Normalization Form C only specifies that
+ composite characters like umlauts are stored in
+ composited, not decomposited form. See the spec.)
+- Wouldn't it be easier to produce the serialization format
+ for each triple, and then put those into lexical order?
+ Or if the parts must be compared
+ separately, could we compare serializations of those parts?
+ RESOLVED: We assume that a Canon3 writer usually operates
+ on an in-memory representation of an RDF graph. That
+ makes it easy to sort triples in unencoded, and hard
+ to sort them in the encoded way. It's also more scalable:
+ Sorting on the serializations would mean having to
+ generate the whole serialization in memory first,
+ before writing anything to the disk.
+ (Also note that simply sorting the *lines*
+ wouldn't work anyway, because of multiline literals.)
@@ -39,13 +122,35 @@
URItoken ::= "<" URIref ">"
anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*
literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers
- qualifiers ::= ("@" language)? ("^^" URItoken)?
+ qualifiers ::= ("@" language)? ("^^" type)?
+ type ::= URItoken
-The ``NEWLINE`` token may be any of CR, LF, and CRLF.
-(This is necessary for CVS to be useful across platforms.)
+The ``NEWLINE`` token may be any of CR, LF, CRLF, and
+the Unicode LINE SEPARATOR (U+2028).
+This is necessary for CVS, to be useful across platforms.
In contexts where the specific form used matters,
the newline character is LF. (In particular, when computing
a content hash-- e.g., when creating a Canon3 Storm block.)
+It would be nicer to use LINE SEPARATOR, but that
+would be an incompatibility with N3.
+A ``string`` is any UTF-8 character sequence
+encoded in the following way:
+- Double any backslash in the string.
+- Insert a backslash before the first of any three
+ consecutive double quotes (#x22) in the string.
+ (This means: In a sequence of three or more
+ double quote characters, instert a backslash
+ before all but the last two double quotes).
+For example, the string ``f\oo"""""ba"r`` becomes
+Strings may contain newlines. Like all of Canon3,
+they are encoded in Normalization Form C.
+They are enclosed in triple double quotes
+(see production ``literal``).
The triples must be ordered. Two triples are compared
by comparing their subjects, properties, and objects
@@ -77,37 +182,30 @@
equal triples in the graph to be serialized, this
triple must occur only once in the serialization.
-``URIref`` is a URI reference as defined in [RFC 2396].
-Percent escapes (e.g. ``%2f``) should preferably
-be encoded in lower case. URIref may be either of the following:
-1. An absolute URI (e.g., ``http://example.org/``).
-2. An absolute URI plus a fragment identifier
- (e.g., ``http://example.org/#foo``).
-3. The empty URI reference (which is a relative URI
- refering to the current document).
-4. A standalone fragment identifier (e.g., ``#foo``),
- refering to a fragment of the current document.
+``URIref`` is one of the following:
-``language`` is a Language-Tag as defined by [RFC 3066].
+1. An `RDF URI reference`_ encoded in UTF-8 (Normalization
+ Form C) as the rest of Canon3.
+2. An RDF URI reference with everything before the
+ fragment identifier (if any) omitted. This refers
+ to the current document (in the case of the empty
+ string) or to a fragment of it (e.g., ``#foo``).
-A ``string`` is any UTF-8 character sequence
-encoded in the following way:
-- Double any backslash in the string.
-- Insert a backslash before the first of any three
- consecutive double quotes (#x22) in the string.
- (This means: In a sequence of three or more
- double quote characters, instert a backslash
- before all but the last two double quotes).
+``language`` is a Language-Tag as defined by [RFC 3066].
+If present, ``language`` and ``type`` indicate
+the `language tag and data type`_ of a literal.
-For example, the string ``f\oo"""""ba"r`` becomes
+Here's an example Canon3 file::
-Strings may contain newlines. Like all of Canon3,
-they are encoded in Normalization Form C.
-They are enclosed in triple double quotes
-(see production ``literal``).
+ # Canon3 <http://fenfire.org/2003/Canon3/1.0/>
+ <> <http://example.org/isa> <http://example.org/document>.
+ <> <http://example.org/name> """Foobar
+ An example Canon3 "document\""""@en.
+ <> <http://example.org/name> """Foobar
+ Ein Beispiel eines Canon3-"Dokumentes\""""@de.
+ <#Foo> <http://example.org/name> """Foo fragment identifier""".
+ <http://example.org> <urn:x-foo:related> <urn:x-foo:rittlefricks>.
+ <http://example.org> <urn:x-files:rating>
We will register a MIME type for Canon3.
@@ -117,4 +215,6 @@
.. _Normalization Form C: http://www.unicode.org/unicode/reports/tr15/
.. _NTriples: http://www.w3.org/TR/rdf-testcases/#ntriples
.. _Notation 3: http://www.w3.org/DesignIssues/Notation3.html
+.. _RDF URI reference: http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
+.. _language tag and data type:
\ No newline at end of file