lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files


From: Greg Chicares
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sat, 20 Feb 2016 14:33:01 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-20 12:57, Vadim Zeitlin wrote:
> On Sat, 20 Feb 2016 04:12:25 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2016-02-20 03:16, Vadim Zeitlin wrote:
> GC> > 
> GC> >  I decided to extend my tests checking that all tables in qx_ins and 
> qx_cso
> GC> > databases survive the round trip through the new table code to also do 
> the
> GC> > same for the tables in qx_ann and got several failures due to the 
> presence
> GC> > of unknown "fields" in some of the tables here.
[...]
> GC> >  One of them looks like a real field as it's present in several files: 
> it's
> GC> > the "Editor: " one. I don't know at all what to do about it as there is 
> no
> GC> > corresponding field in the binary format, so there doesn't seem to be 
> any
> GC> > way to store the value of this field in it.
> GC> 
> GC> Please tell me the number of a 'qx_ann' table that has this field so that
> GC> I can examine it. I don't remember ever seeing "Editor:" in these files.
> 
>  It occurs in the following tables:
> 
> 893 894 895 896 897 898 952 953 954 955 956 957 958 959 960 961 962 963 964 
> 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 
> 984

Let's examine table 893. Text output:

Table number: 893
Table type: Aggregate
Contributor: Roger Scott Lumsden address@hidden
Source of data: Population mortality (projected and modified for Pensioners)
Unit of observation: Number of lives
...
Comments: These are supposed to represent the expected mortality of pensioners 
from
the generation born in 1950, updated through 1990-92 census results.
This is from the diskette available with
"The Second Actuarial Study of Mortality in Europe"
Editor: A.S.MacDonald

In the raw data,

0574740 stx nul eot nul   } etx nul nul etx nul soh nul   A eot nul   -
0574760 nul   R   o   g   e   r  sp   S   c   o   t   t  sp   L   u   m
0575000   s   d   e   n  sp   7   5   1   4   7   .   2   6   2   0   @
0575020   c   o   m   p   u   s   e   r   v   e   .   c   o   m enq nul
0575040   < nul   P   o   p   u   l   a   t   i   o   n  sp   m   o   r
0575060   t   a   l   i   t   y  sp   (   p   r   o   j   e   c   t   e
0575100   d  sp   a   n   d  sp   m   o   d   i   f   i   e   d  sp   f
0575120   o   r  sp   P   e   n   s   i   o   n   e   r   s   )  bs nul
0575140  si nul   N   u   m   b   e   r  sp   o   f  sp   l   i   v   e
0575160   s  ht nul   C soh   B   a   s   e   l   i   n   e  sp   c   a

we can pick out the records:

[Contributor] eot nul - nul R o g e r sp ...
[Source of data] enq nul < nul P o p u l a t i o n ...
[Unit of observation] bs nul si nul N u m b e r sp o f ...

Record "titles" like "Contributor" aren't spelled out; presumably the
four bytes preceding the raw record contents represent the record type.
The source code says:

#define DT_contributor 4
#define DT_dataSource  5
#define DT_unitOfObs   8
#define DT_comments     11

and indeed 4 = EOT, 5 = ENQ, and 8 = BS. Record 11 = \013 = VT is:

0576260  vt nul soh soh   T   h   e   s   e  sp   a   r   e  sp   s   u
0576300   p   p   o   s   e   d  sp   t   o  sp   r   e   p   r   e   s
0576320   e   n   t  sp   t   h   e  sp   e   x   p   e   c   t   e   d
0576340  sp   m   o   r   t   a   l   i   t   y  sp   o   f  sp   p   e
0576360   n   s   i   o   n   e   r   s  sp   f   r   o   m  nl   t   h
0576400   e  sp   g   e   n   e   r   a   t   i   o   n  sp   b   o   r
0576420   n  sp   i   n  sp   1   9   5   0   ,  sp   u   p   d   a   t
0576440   e   d  sp   t   h   r   o   u   g   h  sp   1   9   9   0   -
0576460   9   2  sp   c   e   n   s   u   s  sp   r   e   s   u   l   t
0576500   s   .  nl   T   h   i   s  sp   i   s  sp   f   r   o   m  sp
0576520   t   h   e  sp   d   i   s   k   e   t   t   e  sp   a   v   a
0576540   i   l   a   b   l   e  sp   w   i   t   h  nl   "   T   h   e
0576560  sp   S   e   c   o   n   d  sp   A   c   t   u   a   r   i   a
0576600   l  sp   S   t   u   d   y  sp   o   f  sp   M   o   r   t   a
0576620   l   i   t   y  sp   i   n  sp   E   u   r   o   p   e   "  nl
0576640   E   d   i   t   o   r   :  sp   A   .   S   .   M   a   c   D
0576660   o   n   a   l   d dle nul stx nul ack nul  ff nul stx nul soh

So "Comments:" corresponds to VT, and is not spelled out in the file,
while "Editor:", beginning at 0576640, is just a text string contained
in the "Comment" record.

>  It is actually part of "Comments:" in the binary files, but it surely
> looks like just another header (similar to e.g. "Contributor") in the text
> format.

Yes, to the naive reader, it would appear that this file has a novel
"Editor" record; but in fact it does not. Therefore...

> GC> I have two suggestions:

[...transposing them...]

> GC> (2) Use a regex like /[A-Za-z0-9]* *[A-Za-z0-9]*:/ on the assumption that
> GC> header names consist of one or two words followed by a colon. Deem any
> GC> colon that occurs later in the line to be content rather than markup.

This cannot work. A "Contributor" specified as
  "\nSource of data:\Table number:\nContributor:"
cannot be parsed this way.

>  Yes, I definitely need to do this to avoid at least the obvious false
> positives. The trouble with "Editor:" and "WARNING:" is that they're not
> really obvious, are they.

Actually, we must not do this. And "Editor:" and "WARNING:" are not
record titles and do not begin new records. Records are indicated
by prefixed bytes like EOT and VT. (Therefore, record content must
not include those bytes.)

> GC> (1) Build a whitelist of header names, and reject anything not on the 
> list.
> GC> I imagine that this list will be short; I thought they were enumerated
> GC> in the 1990s code, and perhaps also in the HLP or GID documentation.

The place to start is the macro pseudo-enum in 'dectable.cpp':

#define DT_tableName   1
#define DT_tableNumber 2
...
#define DT_hashValue    18

IIRC, after they published that code, SOA added some new record types,
which are documented nowhere and can be discovered only by the sort of
testing you're doing.

>  Would we include "WARNING" in this whitelist?

No. It's not a record type.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]