lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Calculation summary XML resources structure (with some example


From: Greg Chicares
Subject: Re: [lmi] Calculation summary XML resources structure (with some examples)
Date: Fri, 22 Sep 2006 12:36:36 +0000
User-agent: Thunderbird 1.5.0.4 (Windows/20060516)

On 2006-9-21 13:14 UTC, Vadim Zeitlin wrote:
> On Thu, 21 Sep 2006 14:32:08 +0200 Evgeniy Tarassov <address@hidden> wrote:
> 
> ET> schema.xsd - a Schema file describing illustration data (data.xml) and
> ET> column traits file (format.xml)
> ET> 
> ET> data.xml - an example of illustration data
> 
>  Just to make it simpler to understand the proposed XML format, without
> having to read the entire schema (which is, IMHO, quite readable but an
> example is still clearer), here is an extract from data.xml:

Thanks, both the schema and this extract are helpful.

> <?xml version="1.0" ?>
> <illustration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>                   xsi:noNamespaceSchemaLocation="file:schema.xsd">
>     <!-- string scalars -->
>     <double_scalar name="Age">45</double_scalar>
>     <double_scalar name="GuarMaxMandE">0.09</double_scalar>

I'm comparing some these lines to the xml file we use today for
'fop'. In that other file, it's something like
    <GuarMaxMandE>0.90%</GuarMaxMandE>
Ultimately, we might want to format that as a percentage: "9%" or
"9.00%" or whatever. The schema says

  <xs:complexType name="double_scalar_type">
  ...
    A node value has to be a double in the scientific notation.

so this seems to be a deliberate design decision. I'd like to
ask about that. I am no xslt expert, so maybe I just need to be
educated. Let me explain why I'd thought it might be better to
pass formatted numbers instead.

I had always assumed that C++ would do formatting faster than
any xsl processor, and that we could be absolutely assured of
uniform formatting by using C++. I'm thinking about edge cases
like the real number 1.07, which has no finite binary expansion
and can easily turn into
  1.070000000000001
if we aren't careful. I've taken that care in C++. For some
reports (that are currently done in C++ without any xml or xsl)
we might even want to use 80-bit long doubles someday. And we
already have to format numbers to write them into an xml file
(binary would be neither portable nor human-readable, and the
schema calls for scientific notation anyway), so why not do all
the formatting we want in C++?

I know very little about xsl, so maybe those arguments really
aren't compelling; please let me know what you think. I do see
  http://www.w3.org/TR/xslt20/#function-format-number
mentioned in the schema. (BTW, thanks for documenting this
carefully: it's very useful for me.) Following that link, I
see things like like 'infinity' and 'NaN', which answer other
questions I would have raised. And I see 'percent-sign' and
'per-mille-sign', so it looks like they put a lot of thought
into this. OTOH, we don't use 'per mille', but we might like
to use 'basis points', signified by 'bp', and meaning one
one-hundredth of a percent: xsl doesn't seem to provide that.

So I wonder whether it might be prohibitively difficult to
extend xsl to do things like 'bp' that we might want, but
that the w3c didn't think of. Obviously they can't provide
every type of formatting that every specialized application
domain may require.

Another issue is 'scaling' numbers so that they all fit on
the page. This is a common problem for life-insurance
illustration systems. It's been my experience that other
systems keep applying "quick fixes" but never seem to get it
really right. The most difficult situation is large sales to
corporations, which might have thousands of insurance policies,
each with millions of dollars, growing at interest over many
years. Another system might accommodate numbers up to 1e9,
then extend that to 1e10 when 1e9 becomes insufficient; then
someday 1e10 becomes inadequate, so it's broken again. We
addressed this by dividing all columns by a scale factor
first, and then printing a note like
  "Values are in billions of dollars."
Even though much of our work is quite US-specific because
our industry is intensely regulated, this method already
does the right thing for countries with much smaller
currency units, too.

I also fear losing control, as in the '1.07' example above.
In C++, if we get '1.069999999999999', then I know I can
figure out how to handle that. And I can automatically test
any number of compilers. If we use xslt for formatting, then
doesn't that add dependencies on the xslt implementation, and
the compiler used to build it, and that compiler's runtime
library? You might fairly observe that we're using apache's
'fop' for some output, and that uses java, which has its own
notion of floating-point numbers; but in practice we haven't
found any problem there, because we pass only formatted
numbers to 'fop'.

>     ...
>     <double_scalar name="InitTgtPrem">55394.15</double_scalar>
> 
>     <string_scalar name="AllowDbo3">1</string_scalar>

Here, I start to wonder whether I overspecified the problem.
The 'double_scalar' versus 'string_scalar' distinction is
important only upstream, before formatting is done--if we
do all numeric formatting in C++. By the time any xml output
is written, that distinction no longer really means anything.

IOW, every 'scalar' name is unique, so this is impossible:
    <double_scalar name="UniqueName">55394.15</double_scalar>
    <string_scalar name="UniqueName">1</string_scalar>
and so is this (even with identical values):
    <double_scalar name="UniqueName">1</double_scalar>
    <string_scalar name="UniqueName">1</string_scalar>

I am pretty confident that names are globally unique, so that a
'scalar' and a 'vector' can't share the same name. I'm not sure
that the program enforces that global uniqueness--maybe it's been
just a matter of programming discipline to avoid it--but it would
probably be a good idea to enforce it, eventually, just to make
sure it's a perfectly reliable invariant.

I'm not sure whether that uniqueness would suggest any further
simplification. I mention it just in case it gives you any ideas.

>     <string_scalar name="InitAnnGenAcctInt" 
> basis="run_curr_basis">0.06</string_scalar>

This naming convention seems much nicer than the comparable
    <InitAnnGenAcctInt_Current>6.00%</InitAnnGenAcctInt_Current>
in the file we pass to 'fop' today. Agglutinating attributes into
a single unwieldy variable name was probably a bad design choice
on my part.

>     <variant name="Outlay">
>         <duration>20000</duration>
>       ...
>         <duration>20000</duration>
>     </variant>

The comparable part of the file used with 'fop' today is:

    <newcolumn>
      <column name="Outlay">
        <duration number="0" column_value="20,000"/>
        <duration number="1" column_value="20,000"/>
...
        <duration number="54" column_value="20,000"/>
      </column>
    </newcolumn>

We designed that layout in a great hurry. The code says:
  // TODO ?? Is <newcolumn> really useful?
and your layout:
    <variant name="Outlay">
        ...
    </variant>
seems better than our old layout:
    <newcolumn>
      <column name="Outlay">
        ...
      </column>
    </newcolumn>
in that respect.

The other difference I see here is
>         <duration>20000</duration>
>       ...
>         <duration>20000</duration>
instead of
        <duration number="0" column_value="20,000"/>
...
        <duration number="54" column_value="20,000"/>
and I'd just like to mention that, for 'fop' at least, we really
do need to select certain durations sometimes. For instance,
normally we print every column in its entirety, but there's a law
that requires us to show something like the tenth value, the
twentieth, and the one corresponding to age seventy. Does that
remain possible if we omit 'number='? Here, again, the schema
provides the answer (thanks):

  <xs:complexType name="duration_type">
    As "duration" nodes are always ordered inside a "variant"
    parent node we will use "duration"'s node position()
    to get its index in the parent container.

>     <variant name="AcctVal" basis="run_curr_basis">

BTW, would it be okay to say 'vector' instead of 'variant' (as
above) or 'column' and 'newcolumn' (as in the xml for 'fop')?
That would conform to the terminology used in the C++ code.
IIRC, 'variant' here is not prescribed by xml, so if it's just
an arbitrary name, then let's pick one we all like. I guess I'd
think of a 'vector' as meaning an ordered collection of scalars
(like C++'s std::vector), and maybe a 'column' as what in C++
we might describe this way:

  template<class T>
  class column
  {
  ...
    private:
      std::vector<T> v_;
      std::string title_; // or 'heading_', etc.
      format_struct format_;
  };

>     <supplementalreport>
>         <column name="InitAnnGenAcctInt" basis="run_curr_basis" />
>         <column name="InitAnnGenAcctInt" basis="run_guar_basis" />
>     </supplementalreport>
> </illustration>

This might not matter today, because supplemental reports are
distinct from the calculation summary, but let me say a little
about it now anyway....

Here's part of an xml output file used with 'fop' today:

<illustration>
  <scalar>
    <SupplementalReport>0</SupplementalReport>
    <SupplementalReportColumn00>PolicyYear</SupplementalReportColumn00>
    <SupplementalReportColumn01>[none]</SupplementalReportColumn01>
    <SupplementalReportColumn02>Outlay</SupplementalReportColumn02>
...
  </scalar>
...
  </data>
  <supplementalreport>
    <title>Supplemental Report</title>
    <columns>
      <name>PolicyYear</name>
      <title> _____________ _____________ Policy __Year</title>
    </columns>
...
  </supplementalreport>
</illustration>

As in other circumstances discussed above, your hierarchical
structure is probably better than what we've been using, so I'll
comment only on one point. Here:
    <SupplementalReportColumn00>PolicyYear</SupplementalReportColumn00>
    <SupplementalReportColumn01>[none]</SupplementalReportColumn01>
    <SupplementalReportColumn02>Outlay</SupplementalReportColumn02>
the middle '01' column is significant as a spacer. The serial
numbers '00' through '11' do correspond exactly to variable
names in the program; I don't know whether that has any real
benefit, and I imagine that the same effect could be achieved
by making an empty 'spacer' available.

As with other work you showed above, I really like this:
        <column name="InitAnnGenAcctInt" basis="run_curr_basis" />
        <column name="InitAnnGenAcctInt" basis="run_guar_basis" />
instead of 'InitAnnGenAcctInt_Current' etc.

>  And here is the (fragment of) HTML generated from it using html.xsl:

[snipped] That looks good, though I didn't try rendering it in
a browser. But the real point is that I don't need to, because
it can always be modified, without any C++ change: as you say...

>  Of course, many additional customizations are possible, either via
> formats.xml file or by changing/extending html.xsl.

To summarize, this looks like a solid improvement over what we
were able to do in the past without the sort of xslt expertise
you provide. Already I see opportunities to simplify other parts
of 'lmi' in the future by building upon this work. The main
question for now is where we should format numbers.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]