lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Greg Chicares
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 15:09:04 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-21 13:01, Vadim Zeitlin wrote:
> On Sun, 21 Feb 2016 12:38:51 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2016-02-20 15:12, Vadim Zeitlin wrote:
> GC> > On Sat, 20 Feb 2016 14:33:01 +0000 Greg Chicares <address@hidden> wrote:
> GC> > 
> GC> > GC> Yes, to the naive reader, it would appear that this file has a novel
> GC> > GC> "Editor" record; but in fact it does not. Therefore...
> GC> > 
> GC> >  Sorry, I'm afraid I didn't explain the problem well at all and wasted 
> your
> GC> > time with all these investigations. Let me try to explain it once again:
> GC> > 
> GC> >  There are 2 formats for the tables. One of them is the binary format 
> used
> GC> > in the .dat files. The other one is the text format used for the output 
> of
> GC> > --extract table_tool option and the input of its --merge option. There 
> is
> GC> > no problem whatsoever with the binary format, the question of this 
> thread
> GC> > is about how to handle unknown words followed by colons in the text 
> files
> GC> > only.
> GC> > 
> GC> >  E.g. the round-trip test done by table_tool --verify option reads all
> GC> > tables from the binary database, converts each of them to text and then
> GC> > parses this text as a table and compares that it obtains exactly the 
> same
> GC> > table as the original one.
> GC> 
> GC> Where you say it "parses this text as a table", do you mean exactly the
> GC> same thing as converting the text back into binary format...so that we
> GC> have only two concepts, text and binary?
> GC> 
> GC> Or are you thinking of the "table" that you compare as the platonic
> GC> ideal, which has two incidental projections, one onto a text format, and
> GC> the other onto a binary format...neither of which is the real "table"?
> 
>  The second view describes better how things really work because the C++
> "table" object can indeed be seen as an abstract representation of a table
> which just happens to be convertible to and from its text and binary
> representation.

Okay, but...

> GC> I was thinking of it the first way, so that '--verify' converts
> GC>   binary format --> text format --> binary format
> GC> and the round-trip test is satisfied if the binary input and output
> GC> are bit-for-bit identical.
> 
>  It could work like this and I do have such checks in the unit tests too,
> but what really happens for them is
> 
>       binary --> platonic --> text --> platonic --> binary

That's what we need...but...

> The table_tool --verify option avoids the last step because it didn't seem
> very useful and would require creating a temporary file (currently there is
> no support for creating binary representation in memory and I don't see any
> real need to add it), so it does just
> 
>       binary --> platonic --> text --> platonic
> 
> and compares the two C++ objects.

Demonstrating commutativity requires validating all five steps:
  binary --> platonic --> text --> platonic --> binary
That five-step process is a typical and important "workflow":
we extract a table that's known to be defective, repair its
defects in the text format, and merge the modified text back
into the binary database.

The points I keep making that seem confusing relate to the
conversions toward the end of the chain. I'm giving testcases
that may not translate identically (and therefore losslessly)
back to the same binary file we started with.

> GC> If that's the condition tested by '--verify',
> GC> then the content of a binary "Comment" (VT) record like
> GC>   <VT> [bytecount] "\nContributor: foo\nNonexistent: bar"
> GC> must not be parsed as markup for a "Contributor" (EOT) record and
> GC> a "Nonexistent" record.
> 
>  Yes, of course, and it is not parsed like this or even "parsed" in any way
> at all. As I said, there is no problem with reading the data in binary
> format, the only questions are in "text --> platonic" step.

Yes, or in the subsequent "[text -->] platonic --> binary" step.

> GC> Yes, and I think the only way to prevent unrecognized record types is
> GC> to accept only recognized record types. In the motivating case:
> GC> 
> GC> Comments: These are supposed to represent the expected mortality of 
> pensioners from
> GC> the generation born in 1950, updated through 1990-92 census results.
> GC> This is from the diskette available with
> GC> "The Second Actuarial Study of Mortality in Europe"
> GC> Editor: A.S.MacDonald
> GC> 
> GC> "Comments:" tags a record; "Editor:" does not, so it must be mere
> GC> content in a "Comments" record.
> 
>  This is easy to do, of course, but I'd just like to note that we could
> also have
> 
>       Comments: Whatever
>       Contributer: A.U.Thor
> 
> and this would still be parsed as a single "Comments" field because of the
> typo in the "Contributor" header. I only check for the valid field names to
> avoid problems like this. Do you think this has no value? My understanding
> was that these text files were created manually, so checking for typos in
> them seemed like a good idea.

I agree. Introducing a typo in a header name when editing the text
file means that the modified text file cannot be merged back into
the binary file as intended. In this introduced-typo case, what
was originally a "Contributor" record would become part of the
content of a "Comments" record. The program you're writing in
principle cannot diagnose this error. It must merge the text as a
"Comments" record. It might issue a warning, which would catch this
user error and would also flag the "Editor:" contents as a possible
mistake.

> GC> >  And the question is whether we should:
> GC> > 
> GC> > 1. Just silently ignore all unknown fields ("ignore" means considering 
> them
> GC> >    to be part of the value of the previous line).
> GC> > 2. Give an error about them (as the code used to behave).
> GC> > 3. Give an error about them except if they're in a (hopefully short) 
> list
> GC> >    of known non-fields (as the latest version of the code does).
> GC> 
> GC> You're asking what to do about "unknown fields" like "Editor:" above.
> GC> I'm saying they aren't fields, which implies (1).
> 
>  Formally you're right, of course. I just thought we could be slightly more
> helpful.

Yes, this is a QoI issue, and it is better to give a purely advisory
warning before proceeding. It would be wrong to halt when "Editor:"
is found.

> GC> However, there's a problem. We don't necessarily know the full set of
> GC> record-types, i.e., the true "fields", because the SOA may have expanded
> GC> the set after publishing that code. The full set can be found only by
> GC> examining each binary-format record in every file.
> 
>  OK, this is a completely different problem. Right now unknown records in
> the binary format result in an error as well. I don't think we can
> reasonably do anything else with them because ignoring them and losing data
> doesn't seem appealing.

That's why we need to run the commutativity test against all the
files we have, and inspect the failures. Only that way can we be
sure we have the full whitelist.

> GC> The whitelist is indispensable. Without a full whitelist, when we
> GC> encounter "\nfoo: " while parsing the text format, we cannot say
> GC> whether it's the text image of a binary record, or merely part of
> GC> the content of some record.
> 
>  Sorry, I'm very confused by this as now you seem to be saying that we
> should be doing my (3) above while previously you wrote it should be (1).

Phase I: The program prints a warning, and the round-trip conversion
may fail. Upon manual inspection of warnings and failures, we may
discover undocumented fields. I want to emphasize that I have in fact
discovered at least one undocumented record type.

Phase II: The program is revised to reflect any newly-discovered
field. Round-trip conversion always succeeds. There may still be
warnings about (e.g.) "Editor:".

>  To summarize: currently the code does (3) using a very small whitelist
> which will probably need to be extended if we keep doing this (but I'll
> have to rely on you to run --verify on the tables you use to check this). I

Yes, that's exactly what I have in mind.

> can easily change the code to do (1) instead and, in fact, it would be
> simpler than the current version, but then we'll lose the possibility of
> detecting typos in the input text files.

It is never required to inhibit reasonable warnings.

It is never permissible to terminate with lawful input.

Where you write "Give an error" I assume, perhaps incorrectly, that
you mean "...and terminate the program". Thus, I read (2) and (3)
as leading to termination, which is not wanted. A "warning" is
desirable, after which the program proceeds to do the best it can.

>  Also, I am still not speaking at all about binary files here. There may
> be a problem with unknown record types in them too,

That is a problem that I have actually encountered in the past.
Unknown record types must cause a commutativity test to fail.

> but it's a different
> problem and I'd like to avoid discussing it in this thread because IMO it's
> quite separate.
> 
>  But the urgent question right now is the choice between (1) and (3) and I
> still don't know which one do you prefer, could you please let me know?

Print an informational message and continue processing.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]