monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: .mt-attrs formatting


From: Brian Campbell
Subject: Re: [Monotone-devel] Re: .mt-attrs formatting
Date: Thu, 19 Aug 2004 17:36:17 -0400

I think that what Eric Raymond has to say on the issue is actually fairly relevant. Here is the advice he give on developing a file format, from <http://catb.org/~esr/writings/taoup/html/ch05s02.html>. Obviously, all of these are suggestions, not rules, but they're well thought out, although definitely very Unix-centric.

---
One record per newline-terminated line, if possible. This makes it easy to extract records with text-stream tools. For data interchange with other operating systems, it's wise to make your file-format parser indifferent to whether the line ending is LF or CR-LF. It's also conventional to ignore trailing whitespace in such formats; this protects against common editor bobbles.

Less than 80 characters per line, if possible. This makes the format browseable in an ordinary-sized terminal window. If many records must be longer than 80 characters, consider a stanza format (see below).

Use # as an introducer for comments. It is good to have a way to embed annotations and comments in data files. It's best if they're actually part of the file structure, and so will be preserved by tools that know its format. For comments that are not preserved during parsing, # is the conventional start character.

Support the backslash convention. The least surprising way to support embedding nonprintable control characters is by parsing C-like backslash escapes — \n for a newline, \r for a carriage return, \t for a tab, \b for backspace, \f for formfeed, \e for ASCII escape (27), \nnn or \onnn or \0nnn for the character with octal value nnn, \xnn for the character with hexadecimal value nn, \dnnn for the character with decimal value nnn, \\ for a literal backslash. A newer convention, but one worth following, is the use of \unnnn for a hexadecimal Unicode literal.

In one-record-per-line formats, use colon or any run of whitespace as a field separator. The colon convention seems to have originated with the Unix password file. If your fields must contain instances of the separator(s), use a backslash as the prefix to escape them.

Do not allow the distinction between tab and whitespace to be significant. This is a recipe for serious headaches when the tab settings on your users' editors are different; more generally, it's confusing to the eye. Using tab alone as a field separator is especially likely to cause problems; allowing any run of tabs and spaces to be a field separator, on the other hand, works well.

Favor hex over octal. Hex-digit pairs and quads are easier to eyeball-map into bytes and today's 32- and 64-bit words than octal digits of three bits each; also marginally more efficient. This rule needs emphasizing because some older Unix tools such as od(1) violate it; that's a legacy from the instruction field sizes in the machine languages of older PDP minicomputers.

For complex records, use a ‘stanza’ format: multiple lines per record, with a record separator line of %%\n or %\n. The separators make useful visual boundaries for human beings eyeballing the file.

In stanza formats, either have one record field per line or use a record format resembling RFC 822 electronic-mail headers, with colon-terminated field-name keywords leading fields. The second choice is appropriate when fields are often either absent or longer than 80 characters, or when records are sparse (e.g., often with empty fields).

In stanza formats, support line continuation. When interpreting the file, either discard backslash followed by whitespace or interpret newline followed by whitespace equivalently to a single space, so that a long logical line can be folded into short (easily editable!) physical lines. It's also conventional to ignore trailing whitespace in these formats; this convention protects against common editor bobbles.

Either include a version number or design the format as self-describing chunks independent of each other. If there is even the faintest possibility that the format will have to be changed or extended, include a version number so your code can conditionally do the right thing on all versions. Alternatively, design the format as self-describing chunks so that you can add new chunk types without instantly breaking old code.

Beware of floating-point round-off problems. Conversion of floating-point numbers from binary to text format and back can lose precision, depending on the quality of the conversion library you are using. If the structure you are marshaling/unmarshaling contains floating point, you should test the conversion in both directions. If it looks like conversion in either direction is subject to roundoff errors, be prepared to dump the floating-point field as raw binary instead, or a string encoding thereof. If you're coding in C or some language that has access to C printf/scanf, the C99 %a specifier may solve this problem.

Don't bother compressing or binary-encoding just part of the file. See below...
---

On Aug 19, 2004, at 5:14 PM, Richard Levitte - VMS Whacker wrote:

In message <address@hidden> on Thu, 19 Aug 2004 15:14:14 -0400, "graydon hoare" <address@hidden> said:

graydon> eh, actually, I would prefer to:
graydon>
graydon>   - stick with one space
graydon> - change the regex to handle files with leading whitespace in names

So let's see if I get you right, you're saying that you want one
insignificant space followed by a number (0 or more) spaces,
potentially looking exactly the same as that first insignificant one.

I don't know how else to say this: from a technical point of view,
fine.  From a usability and user interface point of view, I find that
to be a horrible idea, riddled with confusion and error prone as a
result.  Just look at the confusion it has created already.  Let's not
forget that as much hackers as we may be, we're also human, and there
will always be a number of confused people who will get the space
count wrong, thus increasing the number of user support calls/emails,
potentially becoming a FAQ that very gently says "you didn't count
your spaces correctly, dummy!"

The existence of tabs will not make matters better, trust me.  I
noticed that you do span over [^[:space:]] to find non-space stuff,
while you use an actual space to match the delimiter itself.  This
means that while you allow a tab to act as a delimiter, you will then
refuse to accept the tab as a delimiter.

I suggest that it should be possible to enclose a file name with
quotes (single or double, I don't care), for the sake of clarity.

Or have a completely different delimiter than space.  I've no idea
what syntax you're imagining that basic_io.cc would handle...

Cheers,
Richard

-----
Please consider sponsoring my work on free software.
See http://www.free.lp.se/sponsoring.html for details.

--
Richard Levitte                         address@hidden
                                        http://richard.levitte.org/


_______________________________________________
Monotone-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/monotone-devel






reply via email to

[Prev in Thread] Current Thread [Next in Thread]