lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Vadim Zeitlin
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sat, 20 Feb 2016 16:12:59 +0100

On Sat, 20 Feb 2016 14:33:01 +0000 Greg Chicares <address@hidden> wrote:

GC> Yes, to the naive reader, it would appear that this file has a novel
GC> "Editor" record; but in fact it does not. Therefore...

 Sorry, I'm afraid I didn't explain the problem well at all and wasted your
time with all these investigations. Let me try to explain it once again:

 There are 2 formats for the tables. One of them is the binary format used
in the .dat files. The other one is the text format used for the output of
--extract table_tool option and the input of its --merge option. There is
no problem whatsoever with the binary format, the question of this thread
is about how to handle unknown words followed by colons in the text files
*only*.

 E.g. the round-trip test done by table_tool --verify option reads all
tables from the binary database, converts each of them to text and then
parses this text as a table and compares that it obtains exactly the same
table as the original one. The code doing the parsing tries to be rather
strict, as previously discussed, so it complains if it sees something that
looks like a field at the beginning of a line but isn't actually a known
field. We would, presumably, like to prevent this from happening.

 And the question is whether we should:

1. Just silently ignore all unknown fields ("ignore" means considering them
   to be part of the value of the previous line).
2. Give an error about them (as the code used to behave).
3. Give an error about them except if they're in a (hopefully short) list
   of known non-fields (as the latest version of the code does).

 Again, the question is only about this, but I have the impression that I
hadn't expressed this clearly and so you've been answering something else.


GC> > GC> (2) Use a regex like /[A-Za-z0-9]* *[A-Za-z0-9]*:/ on the assumption 
that
GC> > GC> header names consist of one or two words followed by a colon. Deem any
GC> > GC> colon that occurs later in the line to be content rather than markup.
GC> 
GC> This cannot work. A "Contributor" specified as
GC>   "\nSource of data:\Table number:\nContributor:"
GC> cannot be parsed this way.

 Sorry, I don't understand this at all. If we have a line starting with
"Contributor:" in the text input, it will be parsed as contributor. Notice
that if it is followed immediately by "\n", an error will be given about
missing contributor value.

GC> >  Yes, I definitely need to do this to avoid at least the obvious false
GC> > positives. The trouble with "Editor:" and "WARNING:" is that they're not
GC> > really obvious, are they.
GC> 
GC> Actually, we must not do this. And "Editor:" and "WARNING:" are not
GC> record titles and do not begin new records. Records are indicated
GC> by prefixed bytes like EOT and VT.

 Yes, in binary format. I'm only speaking about _reading_ (not writing)
text files.

GC> (Therefore, record content must not include those bytes.)

 Sorry for being pedantic, but this is not really correct, the fields in
these files are prefixed by their type and length, i.e. the strings can
contain any bytes, including NUL.

GC> >  Would we include "WARNING" in this whitelist?
GC> 
GC> No. It's not a record type.

 Again, I think this answers some other question from the one I had asked
because there are no records in the text format.

 FWIW I did include "WARNING" in the list of known not-fields together with
"Editor" for now just to let qx_ann validate successfully. I can remove it
from there, of course, or even drop the idea of such whitelist entirely.
But we'd need some other solution then and this one seems the best to me so
far.

 Please let me know if I'm missing something here...
VZ

reply via email to

[Prev in Thread] Current Thread [Next in Thread]