lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Calculation summary XML resources structure (with some example


From: Evgeniy Tarassov
Subject: Re: [lmi] Calculation summary XML resources structure (with some examples)
Date: Mon, 25 Sep 2006 18:16:52 +0200

Hello,

... so why not do all
the formatting we want in C++?

Let me try to compile a quick summary of pros and cons for C++ and
XSLT number formatting:

0. schema
Report generation involving apache.fop would be done in several steps:

data --(c++,libxml2)--> data.xml --(XSLT,libxslt,fo.xsl)--> report.fo
--(apache-fop)--> report.pdf

Any number-formatting should be done either during the first step (in
c++ code) or during the second step (xslt), that way no important data
manipulation will be necessary by apache.fop (based on java) -- only
formatting.
So we don't have to worry about any possible implementation details of
a fo-processor lmi would use.

1. Speed
C++ is faster than xslt (for at least one reason -- libxslt uses
almost the same C routines to implement number formatting but it needs
to interpret that the compiled version of the template in runtime). However
that difference (given that xslt implementation is smart enough)
shouldn't be noticable compared to the template loading and
compilation time. Of course any such statement is based on intuition
and libxslt needs to be carefully tested and profiled. But one could
say that early optimisation is evil too.

2. Precision
While we can already assure precision in C++ code, formatting done in
XSLT will require some testing. Maybe an additional testcase for
libxslt that we could be compiled on different platforms/compilers to
assure the exact formatting we need.

3. Design
From the design point of view C++ code would be taking care of our
"Model" (algorithms, data) and XSLT would be the "View" part of the
application.

4. Scaling
On the XSLT side this could be done as a preprocessing step that
generates a scaling factor to apply to double scalars and a
explanation message that would be put somewhere inside the document
generated (previously mentioned "Values are in billions of dollars").

I know very little about xsl, so maybe those arguments really
aren't compelling; please let me know what you think. I do see
  http://www.w3.org/TR/xslt20/#function-format-number

This function was used in the example XSL since it was a common
denominator for our formatting rules (F0, .. F5). Introduction of a
new rule like 'bp' will need some additional xsl-coding but is
surely feasible.
I think that any format number we have mentioned could be implemented
using standard XPath functions and some standard arithmetics
(http://www.w3.org/TR/2003/WD-xpath-functions-20031112/#d1e1613,
especially fn:round-half-to-even
http://www.w3.org/TR/2003/WD-xpath-functions-20031112/#func-round-half-to-even).
That will surely need a testcase to ensure a uniform functioning of
the mentioned functions.

mentioned in the schema. (BTW, thanks for documenting this
carefully: it's very useful for me.) Following that link, I
see things like like 'infinity' and 'NaN', which answer other
questions I would have raised. And I see 'percent-sign' and
'per-mille-sign', so it looks like they put a lot of thought
into this. OTOH, we don't use 'per mille', but we might like
to use 'basis points', signified by 'bp', and meaning one
one-hundredth of a percent: xsl doesn't seem to provide that.

So I wonder whether it might be prohibitively difficult to
extend xsl to do things like 'bp' that we might want, but

I have tried to find a suitable API entry in libxml2 but it does not
seem to provide any suitable XPath customisation/extension entry
point (any function operating on a node is a part of XPath expression syntax).

Another issue is 'scaling' numbers so that they all fit on
the page. This is a common problem for life-insurance
illustration systems. It's been my experience that other
systems keep applying "quick fixes" but never seem to get it
really right. The most difficult situation is large sales to
corporations, which might have thousands of insurance policies,
each with millions of dollars, growing at interest over many
years. Another system might accommodate numbers up to 1e9,

This "feature" could be implemented on the XSLT using a preparsing
phase (one template called at the beginning) and another special node
in format.xml file that would specify a possible scaling plus an
explanation message to inject somewhere into the report:
format.xml:
<scalings>
  <scale factor="1E+2">Values are in hundreds of dollars</scale>
  <scale factor="1E+3">Values are in thousands of dollars</scale>
  <scale factor="1E+6">Values are in millions of dollars</scale>
  <scale factor="1E+9">Values are in billions of dollars</scale>
  <scale factor="1E+12">Values are in zillions of dollars</scale>
</scalings>

The xslt template would have to calculate the maximum of all values in
data.xml, find suitable scale factor in format.xml and then apply the
scaling to any value being formatted.

I also fear losing control, as in the '1.07' example above.
In C++, if we get '1.069999999999999', then I know I can
figure out how to handle that. And I can automatically test
any number of compilers. If we use xslt for formatting, then
doesn't that add dependencies on the xslt implementation, and
the compiler used to build it, and that compiler's runtime

Yes that will depend on the compiler in exactly the same way lmi c++
code depends on it. If we choose to use XSLT formatting it will surely
need a testcase too.

library? You might fairly observe that we're using apache's
'fop' for some output, and that uses java, which has its own
notion of floating-point numbers; but in practice we haven't
found any problem there, because we pass only formatted
numbers to 'fop'.

This part would stay unchanged as any number would be formatted by the
means of libxslt/libxml2 while we are generating data.fo for the
apache-fop processing.

I am pretty confident that names are globally unique, so that a
'scalar' and a 'vector' can't share the same name. I'm not sure
that the program enforces that global uniqueness--maybe it's been
just a matter of programming discipline to avoid it--but it would
probably be a good idea to enforce it, eventually, just to make
sure it's a perfectly reliable invariant.

It will be really nice to be able to check that double scalars has numeric
data, therefore that distinction between double_scalar and string_scalar.
Our Schema would include a uniqueness constraints for sure. We have
put couple of TODO mentioning name uniqueness problem. Using XMLSchema
we can enforce that a group of parameter values (name and basis
attributes in our case) is unique across the nodeset specified (all
the double_scalar, string_scalar and vector nodes at the same time).

        <duration number="54" column_value="20,000"/>
and I'd just like to mention that, for 'fop' at least, we really
do need to select certain durations sometimes. For instance,
normally we print every column in its entirety, but there's a law
that requires us to show something like the tenth value, the
twentieth, and the one corresponding to age seventy. Does that
remain possible if we omit 'number='? Here, again, the schema

Say we want to reference a 17th of 'AcctVal' vector. The corresponding
XPath expression would be:
If we use the current XML structure with "number" attribute:
/illustration/address@hidden'AcctVal']/address@hidden
For the proposed XML structure:
/illustration/address@hidden'AcctVal']/duration[position()=17]

The expressions are fairly similar (ofcourse given that vector values are
always kept in the same order and index is 1-based).
If at any point during fo-transformation we will need that index, then
we could always reinject that position() value into data.fo using
fo.xsl transformation or we could still get the value using XPath
function position(). The expression syntax does not suffer at all
while the data.xml file could be slightly simplified (smaller in size
and more human-friendly).

>     <variant name="AcctVal" basis="run_curr_basis">

BTW, would it be okay to say 'vector' instead of 'variant' (as

I'm sorry for that confusion -- it seems i have mislabeled the 'vector' values.

names in the program; I don't know whether that has any real
benefit, and I imagine that the same effect could be achieved
by making an empty 'spacer' available.

Sorry i have missed that feature. Either specifing an empty element in
supplementalreport section or introducing a new node 'spacer' will do
the same thing. Maybe your idea of 'spacer' is better than an empty
<column /> element because it's more explicit than a "empty column
means a spacer" rule. Let us know what do you think is better aligned
with lmi philosophy and we will modify schema and *.xsl files
accordingly.

<supplementalreport>
    <column name="InitAnnGenAcctInt" basis="run_curr_basis" />
    <column />
    or
    <spacer />
    <column name="InitAnnGenAcctInt" basis="run_guar_basis" />
</supplementalreport>

you provide. Already I see opportunities to simplify other parts
of 'lmi' in the future by building upon this work. The main
question for now is where we should format numbers.

As you have mentioned above the pros of using C++ code for number
formatting is speed and precision, while the main pro for using xslt
for that is conceptual point of view.
IMHO, it is worth the effort to try to implement the needed
number-formatting on the xslt side. While it needs abit more of work,
it offers numerous long-term 'pluses':

1. logic and representation is truly separated providing better design
and easier maintenance

2. any report formatting changes will not need the main binary to be
recompiled and it could be supplied to clients as 'theme' packages.

3. report generation is ready for possible future localisation (and
maybe even i18n?)

--
Eugene




reply via email to

[Prev in Thread] Current Thread [Next in Thread]