lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Greg Chicares
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 19:15:43 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-21 18:12, Vadim Zeitlin wrote:
> On Sun, 21 Feb 2016 17:43:12 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2016-02-21 15:42, Vadim Zeitlin wrote:
> GC> > On Sun, 21 Feb 2016 15:09:04 +0000 Greg Chicares <address@hidden> wrote:
> GC> > 
> GC> > GC> Demonstrating commutativity requires validating all five steps:
> GC> > GC>   binary --> platonic --> text --> platonic --> binary
> GC> > 
> GC> >  Yes, but the last step has unit tests and I'm relatively confident that
> GC> > it's not going to fail, i.e. if we successfully read a table from binary
> GC> > format representation, we will serialize it to exactly the same
> GC> > representation. I don't think adding the extra tests for this last step 
> are
> GC> > worth the trouble (especially because, as mentioned before, we don't 
> have a
> GC> > way to serialize to an in-memory buffer right now, so we'd have to add 
> it
> GC> > just for this).
> GC> 
> GC> But we could do all of this by combining normal operations, e.g.:
> GC> 
> GC>   for z in 1..5; do \
> GC>     table_tool --file=old_database --extract=$z; \
> GC>     table_tool --file=new_database --merge=$z.txt; \
> GC>   done
> GC> 
> GC> and then 'cmp old_database new_database'.
> 
>  This won't work, actually. The reason is that the order of the tables in
> the original .dat disk file is lost when doing this and I see no possible
> benefit in preserving it (not that it would be possible when doing it table
> by table anyhow).

Was reordering part of the old '--compress'?

Even if it wasn't, would it be easier to implement a sort-tables-by-number
option (which would make the shell code above work as desired) than to
extend '--verify' to do binary --> text --> binary?

>  To avoid any future misunderstandings please notice that the order of the
> tables in the .dat file is, generally speaking, different from the order of
> the tables in the index file .ndx. This latter order *is* preserved when
> the database is written back to disk (although it's still different from
> the table numbers order, so the naive loop above still wouldn't work even
> if the .dat and .ndx orders were the same). But the former order is not, so
> if you have tables "1" and "2" in the index, you could have "2" and "1" in
> the original .dat file, but you will have "1" and "2" in the new .dat file
> and comparing them byte by byte won't succeed.
> 
>  So to properly implement such round trip test we need to create the new
> database, load back all tables in it and check that they are identical to
> the original ones. This is what the existing unit test does for qx_cso and
> qx_ins and I could extend table_tool --verify option to do it as well. It's
> just that this hardly looks like the most important thing to do to me.

To me, making the round trip with every file (including proprietary
tables that you don't have, need, or IMO want) is an obvious and
powerful high-level test that may uncover "unknown unknowns". That's
why I think it's quite important.

> GC> If it's not easy to build that in, then I plan to do it as above.
> 
>  It's slightly bothersome to build this in, but it will have to be done
> because it's much more so to do it from outside.

Unless a sort-tables-by-number operation is easier. It would be like
interposing *nix 'sort' in a pipeline; maybe we'd find other uses for it.

> GC> > GC> Phase I: The program prints a warning, and the round-trip conversion
> GC> > GC> may fail. Upon manual inspection of warnings and failures, we may
> GC> > GC> discover undocumented fields. I want to emphasize that I have in 
> fact
> GC> > GC> discovered at least one undocumented record type.
> GC> > 
> GC> >  In the binary format?
> GC> 
> GC> Yes--for example, an unknown record type 19, outside the enumeration
> GC> {1..18,9999} given in the old SOA code.
> 
>  This presumably needs to be addressed before table_tool can be used, so
> how should the records of this type be handled? What is the corresponding
> text format representation?

This is in the third category of Rumsfeld's trichotomy: "unknown unknowns",
about which we cannot even speculate until we move them into the second
category, "known unknowns", so that we can reduce them to "knowns". That is
one use case for the envisioned round-trip tester.

>  I'm going to make a new version of table_tool tomorrow which will include
> the change to give a warning instead of an error for things that look like
> fields in the text format but are not recognized as such, but I'd also like
> to add the handling of this mysterious type 19 to it, so I'll wait until
> you let me know how it should be handled -- or tell me to not do it.

But maybe there is no type nineteen, though we might discover a type thirty.
I know I once encountered a record type that was unknown to me. I don't
remember what its record number was; "19" was just an example.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]