lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Vadim Zeitlin
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 19:12:19 +0100

On Sun, 21 Feb 2016 17:43:12 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2016-02-21 15:42, Vadim Zeitlin wrote:
GC> > On Sun, 21 Feb 2016 15:09:04 +0000 Greg Chicares <address@hidden> wrote:
GC> > 
GC> > GC> Demonstrating commutativity requires validating all five steps:
GC> > GC>   binary --> platonic --> text --> platonic --> binary
GC> > 
GC> >  Yes, but the last step has unit tests and I'm relatively confident that
GC> > it's not going to fail, i.e. if we successfully read a table from binary
GC> > format representation, we will serialize it to exactly the same
GC> > representation. I don't think adding the extra tests for this last step 
are
GC> > worth the trouble (especially because, as mentioned before, we don't have 
a
GC> > way to serialize to an in-memory buffer right now, so we'd have to add it
GC> > just for this).
GC> 
GC> But we could do all of this by combining normal operations, e.g.:
GC> 
GC>   for z in 1..5; do \
GC>     table_tool --file=old_database --extract=$z; \
GC>     table_tool --file=new_database --merge=$z.txt; \
GC>   done
GC> 
GC> and then 'cmp old_database new_database'.

 This won't work, actually. The reason is that the order of the tables in
the original .dat disk file is lost when doing this and I see no possible
benefit in preserving it (not that it would be possible when doing it table
by table anyhow).

 To avoid any future misunderstandings please notice that the order of the
tables in the .dat file is, generally speaking, different from the order of
the tables in the index file .ndx. This latter order *is* preserved when
the database is written back to disk (although it's still different from
the table numbers order, so the naive loop above still wouldn't work even
if the .dat and .ndx orders were the same). But the former order is not, so
if you have tables "1" and "2" in the index, you could have "2" and "1" in
the original .dat file, but you will have "1" and "2" in the new .dat file
and comparing them byte by byte won't succeed.

 So to properly implement such round trip test we need to create the new
database, load back all tables in it and check that they are identical to
the original ones. This is what the existing unit test does for qx_cso and
qx_ins and I could extend table_tool --verify option to do it as well. It's
just that this hardly looks like the most important thing to do to me.

GC> If it's not easy to build that in, then I plan to do it as above.

 It's slightly bothersome to build this in, but it will have to be done
because it's much more so to do it from outside.


GC> You are deep in this code you're writing, and you possess a lemma
GC> that says no error is possible in the "platonic --> binary" step.
GC> I don't have a proof of that lemma, and I want to be as rigorous
GC> as I can be, so I allow for the possibility that it's not true. I'm
GC> not actually asserting that it's false; I'm just saying I don't
GC> personally have sufficient reason to know that it's true.

 I don't disagree with this and I don't categorically assert that no error
is possible in this step neither. I just say that this is a completely
different question from the one I had asked originally (which is, I think,
was a big part of my confusion, although the impedance mismatch with the
"whitelist" was another one) and that addressing it doesn't help with
answering the original question at all.


GC> There's the misunderstanding (or one misunderstanding). I've been using
GC> "whitelist" to mean
GC>   {Table name, Contributor, Published reference, Comments...}

 OK, I can confirm the misunderstanding: the whitelist never meant this to
me. For me this list is rather the list of the known record types.

GC> OTOH, your "whitelist" consists exactly of things like "Editor:" that
GC> might appear to be record-inception-markers in the text format but are
GC> in fact merely content and not markup. You plan to use this to avoid
GC> noisy warnings about things like "Editor:" that occur frequently enough
GC> to be a nuisance, but are known not to be real record-inception-markers.
GC> The whole point of your "whitelist" is to include "Editor:".

 Yes, exactly.

GC> Is it worth keeping that list of warnings that need not be given? Yes.

 Very good, thanks.

GC> > GC> Phase I: The program prints a warning, and the round-trip conversion
GC> > GC> may fail. Upon manual inspection of warnings and failures, we may
GC> > GC> discover undocumented fields. I want to emphasize that I have in fact
GC> > GC> discovered at least one undocumented record type.
GC> > 
GC> >  In the binary format?
GC> 
GC> Yes--for example, an unknown record type 19, outside the enumeration
GC> {1..18,9999} given in the old SOA code.

 This presumably needs to be addressed before table_tool can be used, so
how should the records of this type be handled? What is the corresponding
text format representation?


 I'm going to make a new version of table_tool tomorrow which will include
the change to give a warning instead of an error for things that look like
fields in the text format but are not recognized as such, but I'd also like
to add the handling of this mysterious type 19 to it, so I'll wait until
you let me know how it should be handled -- or tell me to not do it.

 Thanks again,
VZ

reply via email to

[Prev in Thread] Current Thread [Next in Thread]