lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Greg Chicares
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 17:43:12 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-21 15:42, Vadim Zeitlin wrote:
> On Sun, 21 Feb 2016 15:09:04 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2016-02-21 13:01, Vadim Zeitlin wrote:
> GC> > 
> GC> >  It could work like this and I do have such checks in the unit tests 
> too,
> GC> > but what really happens for them is
> GC> > 
> GC> >         binary --> platonic --> text --> platonic --> binary
> GC> 
> GC> That's what we need...but...
> GC> 
> GC> > The table_tool --verify option avoids the last step because it didn't 
> seem
> GC> > very useful and would require creating a temporary file (currently 
> there is
> GC> > no support for creating binary representation in memory and I don't see 
> any
> GC> > real need to add it), so it does just
> GC> > 
> GC> >         binary --> platonic --> text --> platonic
> GC> > 
> GC> > and compares the two C++ objects.
> GC> 
> GC> Demonstrating commutativity requires validating all five steps:
> GC>   binary --> platonic --> text --> platonic --> binary
> 
>  Yes, but the last step has unit tests and I'm relatively confident that
> it's not going to fail, i.e. if we successfully read a table from binary
> format representation, we will serialize it to exactly the same
> representation. I don't think adding the extra tests for this last step are
> worth the trouble (especially because, as mentioned before, we don't have a
> way to serialize to an in-memory buffer right now, so we'd have to add it
> just for this).

But we could do all of this by combining normal operations, e.g.:

  for z in 1..5; do \
    table_tool --file=old_database --extract=$z; \
    table_tool --file=new_database --merge=$z.txt; \
  done

and then 'cmp old_database new_database'.

If it's not easy to build that in, then I plan to do it as above.
And probably it's not easy to build it in: at a minimum, instead
of the current '-f' to specify the binary file, we'd need '--if'
and '--of' for the same reason the 'dd' utility needs both.

> GC> The points I keep making that seem confusing relate to the
> GC> conversions toward the end of the chain. I'm giving testcases
> GC> that may not translate identically (and therefore losslessly)
> GC> back to the same binary file we started with.
> 
>  What I want to make clear is that the step at which such losses are
> most likely to happen is the "text --> platonic" one and the rest of them
> (e.g. due to unknown record types) will happen in the "binary --> platonic"
> step. It seems extremely unlikely that the conversion in other direction
> doesn't give the expected result.

You're closer to that than I am, so maybe this is all obvious to you.
I'm farther away, so I'm thinking only of binary --> text --> binary
and comparing the endpoints. Thus:

> GC> > GC> If that's the condition tested by '--verify',
> GC> > GC> then the content of a binary "Comment" (VT) record like
> GC> > GC>   <VT> [bytecount] "\nContributor: foo\nNonexistent: bar"
> GC> > GC> must not be parsed as markup for a "Contributor" (EOT) record and
> GC> > GC> a "Nonexistent" record.
> GC> > 
> GC> >  Yes, of course, and it is not parsed like this or even "parsed" in any 
> way
> GC> > at all. As I said, there is no problem with reading the data in binary
> GC> > format, the only questions are in "text --> platonic" step.
> GC> 
> GC> Yes, or in the subsequent "[text -->] platonic --> binary" step.
> 
>  Sorry, either I misunderstand the above or I disagree. But, to be clear,
> no, I don't have any _questions_ about the "platonic --> binary" step,
> there is no possible ambiguity in it and while bugs are always possible, it
> is covered by the full round trip tests on qx_cso and qx_idx in the unit
> test, so I'm relatively confident in it.

You are deep in this code you're writing, and you possess a lemma
that says no error is possible in the "platonic --> binary" step.
I don't have a proof of that lemma, and I want to be as rigorous
as I can be, so I allow for the possibility that it's not true. I'm
not actually asserting that it's false; I'm just saying I don't
personally have sufficient reason to know that it's true.

> GC> I agree. Introducing a typo in a header name when editing the text
> GC> file means that the modified text file cannot be merged back into
> GC> the binary file as intended. In this introduced-typo case, what
> GC> was originally a "Contributor" record would become part of the
> GC> content of a "Comments" record. The program you're writing in
> GC> principle cannot diagnose this error. It must merge the text as a
> GC> "Comments" record. It might issue a warning, which would catch this
> GC> user error and would also flag the "Editor:" contents as a possible
> GC> mistake.
> 
>  OK, so I'll use warning() for such warnings. But the question remains:
> should I have a whitelist to avoid warnings for known non-typos? I.e. right
> now you would be getting warnings for "Editor:" which is not as bad as
> getting an error for it, but still annoying. Is it worth keeping the
> (already implemented) whitelist to avoid these warnings?

First we have to agree on an operative definition of "whitelist"...

> GC> That's why we need to run the commutativity test against all the
> GC> files we have, and inspect the failures. Only that way can we be
> GC> sure we have the full whitelist.
> 
>  I still don't understand this :-( Whitelist applies to the conversions
> from text only. There is no need for the full commutativity test, including
> conversion to binary, to test for it. It doesn't come into play at all when
> converting to binary format.

There's the misunderstanding (or one misunderstanding). I've been using
"whitelist" to mean
  {Table name, Contributor, Published reference, Comments...}
i.e., the set of all valid record types: in the text format, a marker
consisting of a string on that list, preceded by either nothing or by
a newline, and followed by a colon and a space. Thus, "Table name" is
on the whitelist, so /\n*Table name: / is a regex that the text-format
parser should recognize. Maybe there's an unknown "New field" that we
might have to recognize, though we don't know that yet. And we know we
have at least one instance of /\n*Editor: / which does not signify a
new record; I would say that "Editor" is not on the whitelist. The
whole point of my "whitelist" is to *exclude* it.

OTOH, your "whitelist" consists exactly of things like "Editor:" that
might appear to be record-inception-markers in the text format but are
in fact merely content and not markup. You plan to use this to avoid
noisy warnings about things like "Editor:" that occur frequently enough
to be a nuisance, but are known not to be real record-inception-markers.
The whole point of your "whitelist" is to *include* "Editor:".

Is it worth keeping that list of warnings that need not be given? Yes.

> GC> Phase I: The program prints a warning, and the round-trip conversion
> GC> may fail. Upon manual inspection of warnings and failures, we may
> GC> discover undocumented fields. I want to emphasize that I have in fact
> GC> discovered at least one undocumented record type.
> 
>  In the binary format?

Yes--for example, an unknown record type 19, outside the enumeration
{1..18,9999} given in the old SOA code.

> This is a very important new fact. But it still
> doesn't help with my confusion above because such undocumented record type
> must have resulted in an error when reading the database, i.e. at the very
> first step.

Yes, it did result in an unrecoverable error, which required changing
the program.

> And to fix this we need to add support for this record type,
> changing the whitelist is neither necessary nor sufficient for this.

We'd need to add it to the list of legitimate record types, which I was
calling a "whitelist".

We certainly wouldn't add it to the list of known strings that look like
record markers but are known not to be--which is what you've been calling
a "whitelist".

Now, inadvertently expanding the confusion by trying to constrain it,
I said:

> GC> It is never required to inhibit reasonable warnings.

We were talking past each other, and here I was trying to step back and
declare the absolute minimum requirements for the program to be usable at
all. This is kind of like rationalizing that gcc is a "conforming
implementation" when it triggers a "pure virtual function called" message
at runtime, upon encountering the inheritance-namespace oddity we recently
discussed. There, gcc might have given a reasonable warning, which would
have saved us a lot of effort, and that certainly is a QoI issue, even
though we might be forced to admit that it's not an absolute requirement
imposed by the language standard.

If table_tool were to inhibit reasonable warnings, that would be a QoI
issue; yet it would perform correctly for correct input.

I do not want to decrease QoI by removing reasonable code that you have
already written to suppress known-bogus warnings.

>  This seems to answer my question above: if you don't consider warnings,
> even about things such as "WARNING:" in the comments, a problem, then we
> don't need a whitelist at all. I am not sure if it's a good idea because
> the warnings risk being common and hence annoying, but so be it.

I hope I've explained that better now.

I was trying to explain what my "whitelist" couldn't possibly do. But
I want to focus on this, which is what my "whitelist" is intended for:

> GC> It is never permissible to terminate with lawful input.
> 
>  The question here is, of course, about what exactly constitutes lawful
> input.

That is indeed the question. SOA has given us a program and documentation,
and they have published tables that conform to unpublished documentation
that extends the cognizable record types. How can we discover their
extensions? By running all the tables that we actually use through a
commutativity tester. There is no other way than exhaustive testing.

> E.g. the checks on the input file format are rather strict right
> now, for example the age must be specified as "  1" (always 3 positions)
> and there must be exactly the given number of digits (as specified by the
> "Number of decimal places") in all the values. We could easily be more
> forgiving and only warn about using "1" for the age and "0.12" for the
> value instead of "0.12000" when the number of decimal places is 5. I had
> understood that we didn't want to do this, but now I suspect I could have
> been wrong about this. Was I?

"  1" for age: okay, that's acceptable, and arguably good. It wouldn't
be catastrophic if we accepted any number of leading spaces, but there's
no good reason why we should.

"0.12000" with all five decimal places specified, whenever the number
of decimals is given as five: yes, that's good; it's arguably wrong to
accept "0.12"; and it's certainly wrong to accept "0.123456789".

Rather than discuss every field and decide what deviations from the
canonical might reasonably be accepted, I prefer to abide by a simple
rule: text input must be in canonical format. Thus,

>  If we really need to try to do our best with the input we're given,
> instead of only accepting the input in the strictly defined format, then
> quite a few changes to the code will be needed...

, such changes would not be worthwhile.

> GC> Where you write "Give an error" I assume, perhaps incorrectly, that
> GC> you mean "...and terminate the program". Thus, I read (2) and (3)
> GC> as leading to termination, which is not wanted. A "warning" is
> GC> desirable, after which the program proceeds to do the best it can.
> 
>  An error means an exception and an exception can be caught and handled, of
> course. In table_tool this is not currently done and any exception indeed
> results in program termination.

Consider the "Editor: " example.

Wrong behavior: terminate the program.

Minimally acceptable behavior: print a diagnostic and continue processing.

Ideal behavior: suppress the potential diagnostic because "Editor: " is
known to be on our list of strings that look like record markers but aren't.

> GC> >  Also, I am still not speaking at all about binary files here. There may
> GC> > be a problem with unknown record types in them too,
> GC> 
> GC> That is a problem that I have actually encountered in the past.
> GC> Unknown record types must cause a commutativity test to fail.
> 
>  They will, at the very first step. And here, again, "fail" means failing
> to create the table entirely.

Or, if we start from the binary format, we'd "fail", meaning that we
couldn't read the data into our platonic ideal, much less emit that
ideal as text.

> GC> Print an informational message and continue processing.
> 
>  I'll do this as soon as you can confirm that you really don't want to have
> a whitelist to avoid warnings about known non-typos.

No, I do want to have that "whitelist". It improves QoI.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]