Re: [Monotone-devel] Re: .mt-attrs formatting

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: .mt-attrs formatting

From:	Brian Campbell
Subject:	Re: [Monotone-devel] Re: .mt-attrs formatting
Date:	Thu, 19 Aug 2004 17:36:17 -0400

I think that what Eric Raymond has to say on the issue is actuallyfairly relevant. Here is the advice he give on developing a fileformat, from <http://catb.org/~esr/writings/taoup/html/ch05s02.html>.Obviously, all of these are suggestions, not rules, but they're wellthought out, although definitely very Unix-centric.

---

One record per newline-terminated line, if possible. This makes it easyto extract records with text-stream tools. For data interchange withother operating systems, it's wise to make your file-format parserindifferent to whether the line ending is LF or CR-LF. It's alsoconventional to ignore trailing whitespace in such formats; thisprotects against common editor bobbles.

Less than 80 characters per line, if possible. This makes the formatbrowseable in an ordinary-sized terminal window. If many records mustbe longer than 80 characters, consider a stanza format (see below).

Use # as an introducer for comments. It is good to have a way to embedannotations and comments in data files. It's best if they're actuallypart of the file structure, and so will be preserved by tools that knowits format. For comments that are not preserved during parsing, # isthe conventional start character.

Support the backslash convention. The least surprising way to supportembedding nonprintable control characters is by parsing C-likebackslash escapes — \n for a newline, \r for a carriage return, \t fora tab, \b for backspace, \f for formfeed, \e for ASCII escape (27),\nnn or \onnn or \0nnn for the character with octal value nnn, \xnn forthe character with hexadecimal value nn, \dnnn for the character withdecimal value nnn, \\ for a literal backslash. A newer convention, butone worth following, is the use of \unnnn for a hexadecimal Unicodeliteral.

In one-record-per-line formats, use colon or any run of whitespace as afield separator. The colon convention seems to have originated with theUnix password file. If your fields must contain instances of theseparator(s), use a backslash as the prefix to escape them.

Do not allow the distinction between tab and whitespace to besignificant. This is a recipe for serious headaches when the tabsettings on your users' editors are different; more generally, it'sconfusing to the eye. Using tab alone as a field separator isespecially likely to cause problems; allowing any run of tabs andspaces to be a field separator, on the other hand, works well.

Favor hex over octal. Hex-digit pairs and quads are easier toeyeball-map into bytes and today's 32- and 64-bit words than octaldigits of three bits each; also marginally more efficient. This ruleneeds emphasizing because some older Unix tools such as od(1) violateit; that's a legacy from the instruction field sizes in the machinelanguages of older PDP minicomputers.

For complex records, use a ‘stanza’ format: multiple lines per record,with a record separator line of %%\n or %\n. The separators make usefulvisual boundaries for human beings eyeballing the file.

In stanza formats, either have one record field per line or use arecord format resembling RFC 822 electronic-mail headers, withcolon-terminated field-name keywords leading fields. The second choiceis appropriate when fields are often either absent or longer than 80characters, or when records are sparse (e.g., often with empty fields).

In stanza formats, support line continuation. When interpreting thefile, either discard backslash followed by whitespace or interpretnewline followed by whitespace equivalently to a single space, so thata long logical line can be folded into short (easily editable!)physical lines. It's also conventional to ignore trailing whitespace inthese formats; this convention protects against common editor bobbles.

Either include a version number or design the format as self-describingchunks independent of each other. If there is even the faintestpossibility that the format will have to be changed or extended,include a version number so your code can conditionally do the rightthing on all versions. Alternatively, design the format asself-describing chunks so that you can add new chunk types withoutinstantly breaking old code.

Beware of floating-point round-off problems. Conversion offloating-point numbers from binary to text format and back can loseprecision, depending on the quality of the conversion library you areusing. If the structure you are marshaling/unmarshaling containsfloating point, you should test the conversion in both directions. Ifit looks like conversion in either direction is subject to roundofferrors, be prepared to dump the floating-point field as raw binaryinstead, or a string encoding thereof. If you're coding in C or somelanguage that has access to C printf/scanf, the C99 %a specifier maysolve this problem.

Don't bother compressing or binary-encoding just part of the file. Seebelow...

---

On Aug 19, 2004, at 5:14 PM, Richard Levitte - VMS Whacker wrote:

In message <address@hidden> on Thu, 19 Aug 200415:14:14 -0400, "graydon hoare" <address@hidden> said:


graydon> eh, actually, I would prefer to:
graydon>
graydon>   - stick with one space

graydon> - change the regex to handle files with leading whitespacein names


So let's see if I get you right, you're saying that you want one
insignificant space followed by a number (0 or more) spaces,
potentially looking exactly the same as that first insignificant one.

I don't know how else to say this: from a technical point of view,
fine.  From a usability and user interface point of view, I find that
to be a horrible idea, riddled with confusion and error prone as a
result.  Just look at the confusion it has created already.  Let's not
forget that as much hackers as we may be, we're also human, and there
will always be a number of confused people who will get the space
count wrong, thus increasing the number of user support calls/emails,
potentially becoming a FAQ that very gently says "you didn't count
your spaces correctly, dummy!"

The existence of tabs will not make matters better, trust me.  I
noticed that you do span over [^[:space:]] to find non-space stuff,
while you use an actual space to match the delimiter itself.  This
means that while you allow a tab to act as a delimiter, you will then
refuse to accept the tab as a delimiter.

I suggest that it should be possible to enclose a file name with
quotes (single or double, I don't care), for the sake of clarity.

Or have a completely different delimiter than space.  I've no idea
what syntax you're imagining that basic_io.cc would handle...

Cheers,
Richard

-----
Please consider sponsoring my work on free software.
See http://www.free.lp.se/sponsoring.html for details.

--
Richard Levitte                         address@hidden
                                        http://richard.levitte.org/


_______________________________________________
Monotone-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monotone-devel

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Monotone-devel] Re: .mt-attrs formatting, (continued)

Prev by Date: Re: [Monotone-devel] Re: .mt-attrs formatting
Next by Date: Re: [Monotone-devel] Re: .mt-attrs formatting
Previous by thread: [Monotone-devel] Re: .mt-attrs formatting
Next by thread: [Monotone-devel] Ancestry Graph
Index(es):
- Date
- Thread