lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Vadim Zeitlin
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 14:01:39 +0100

On Sun, 21 Feb 2016 12:38:51 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2016-02-20 15:12, Vadim Zeitlin wrote:
GC> > On Sat, 20 Feb 2016 14:33:01 +0000 Greg Chicares <address@hidden> wrote:
GC> > 
GC> > GC> Yes, to the naive reader, it would appear that this file has a novel
GC> > GC> "Editor" record; but in fact it does not. Therefore...
GC> > 
GC> >  Sorry, I'm afraid I didn't explain the problem well at all and wasted 
your
GC> > time with all these investigations. Let me try to explain it once again:
GC> > 
GC> >  There are 2 formats for the tables. One of them is the binary format used
GC> > in the .dat files. The other one is the text format used for the output of
GC> > --extract table_tool option and the input of its --merge option. There is
GC> > no problem whatsoever with the binary format, the question of this thread
GC> > is about how to handle unknown words followed by colons in the text files
GC> > only.
GC> > 
GC> >  E.g. the round-trip test done by table_tool --verify option reads all
GC> > tables from the binary database, converts each of them to text and then
GC> > parses this text as a table and compares that it obtains exactly the same
GC> > table as the original one.
GC> 
GC> Where you say it "parses this text as a table", do you mean exactly the
GC> same thing as converting the text back into binary format...so that we
GC> have only two concepts, text and binary?
GC> 
GC> Or are you thinking of the "table" that you compare as the platonic
GC> ideal, which has two incidental projections, one onto a text format, and
GC> the other onto a binary format...neither of which is the real "table"?

 The second view describes better how things really work because the C++
"table" object can indeed be seen as an abstract representation of a table
which just happens to be convertible to and from its text and binary
representation.

GC> I was thinking of it the first way, so that '--verify' converts
GC>   binary format --> text format --> binary format
GC> and the round-trip test is satisfied if the binary input and output
GC> are bit-for-bit identical.

 It could work like this and I do have such checks in the unit tests too,
but what really happens for them is

        binary --> platonic --> text --> platonic --> binary

The table_tool --verify option avoids the last step because it didn't seem
very useful and would require creating a temporary file (currently there is
no support for creating binary representation in memory and I don't see any
real need to add it), so it does just

        binary --> platonic --> text --> platonic

and compares the two C++ objects.

GC> If that's the condition tested by '--verify',
GC> then the content of a binary "Comment" (VT) record like
GC>   <VT> [bytecount] "\nContributor: foo\nNonexistent: bar"
GC> must not be parsed as markup for a "Contributor" (EOT) record and
GC> a "Nonexistent" record.

 Yes, of course, and it is not parsed like this or even "parsed" in any way
at all. As I said, there is no problem with reading the data in binary
format, the only questions are in "text --> platonic" step.

GC> Yes, and I think the only way to prevent unrecognized record types is
GC> to accept only recognized record types. In the motivating case:
GC> 
GC> Comments: These are supposed to represent the expected mortality of 
pensioners from
GC> the generation born in 1950, updated through 1990-92 census results.
GC> This is from the diskette available with
GC> "The Second Actuarial Study of Mortality in Europe"
GC> Editor: A.S.MacDonald
GC> 
GC> "Comments:" tags a record; "Editor:" does not, so it must be mere
GC> content in a "Comments" record.

 This is easy to do, of course, but I'd just like to note that we could
also have

        Comments: Whatever
        Contributer: A.U.Thor

and this would still be parsed as a single "Comments" field because of the
typo in the "Contributor" header. I only check for the valid field names to
avoid problems like this. Do you think this has no value? My understanding
was that these text files were created manually, so checking for typos in
them seemed like a good idea.

GC> >  And the question is whether we should:
GC> > 
GC> > 1. Just silently ignore all unknown fields ("ignore" means considering 
them
GC> >    to be part of the value of the previous line).
GC> > 2. Give an error about them (as the code used to behave).
GC> > 3. Give an error about them except if they're in a (hopefully short) list
GC> >    of known non-fields (as the latest version of the code does).
GC> 
GC> You're asking what to do about "unknown fields" like "Editor:" above.
GC> I'm saying they aren't fields, which implies (1).

 Formally you're right, of course. I just thought we could be slightly more
helpful.

GC> However, there's a problem. We don't necessarily know the full set of
GC> record-types, i.e., the true "fields", because the SOA may have expanded
GC> the set after publishing that code. The full set can be found only by
GC> examining each binary-format record in every file.

 OK, this is a completely different problem. Right now unknown records in
the binary format result in an error as well. I don't think we can
reasonably do anything else with them because ignoring them and losing data
doesn't seem appealing.

GC> The whitelist is indispensable. Without a full whitelist, when we
GC> encounter "\nfoo: " while parsing the text format, we cannot say
GC> whether it's the text image of a binary record, or merely part of
GC> the content of some record.

 Sorry, I'm very confused by this as now you seem to be saying that we
should be doing my (3) above while previously you wrote it should be (1).

 To summarize: currently the code does (3) using a very small whitelist
which will probably need to be extended if we keep doing this (but I'll
have to rely on you to run --verify on the tables you use to check this). I
can easily change the code to do (1) instead and, in fact, it would be
simpler than the current version, but then we'll lose the possibility of
detecting typos in the input text files.

 Also, I am still not speaking at all about binary files here. There may
be a problem with unknown record types in them too, but it's a different
problem and I'd like to avoid discussing it in this thread because IMO it's
quite separate.

 But the urgent question right now is the choice between (1) and (3) and I
still don't know which one do you prefer, could you please let me know?

 Thanks in advance,
VZ

reply via email to

[Prev in Thread] Current Thread [Next in Thread]