lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Calculation summary XML resources structure (with some example


From: Greg Chicares
Subject: Re: [lmi] Calculation summary XML resources structure (with some examples)
Date: Tue, 26 Sep 2006 05:24:37 +0000
User-agent: Thunderbird 1.5.0.4 (Windows/20060516)

On 2006-9-25 16:16 UTC, Evgeniy Tarassov wrote:
>
>> ... so why not do all
>> the formatting we want in C++?
>
> Let me try to compile a quick summary of pros and cons for C++ and
> XSLT number formatting:

This is an important issue, so let me try putting the pros and cons
in my own words. Given a datum like
  double i = 0.05;
and a report that needs to format that number, e.g., as
  5.000%
the question is where the formatting should be done.

Restating your description of the data flow a little differently:

(0) numeric data from C++ calculations...
(1) --(C++, libxml2)-->          data.xml
...numbers changed to strings in step (1) or (2)?
(2) --(XSLT, libxslt, fo.xsl)--> report.fo
(3) --(java, apache-fop)-->      report.pdf

we agree that (3) is not the answer--it's java, after all. And (0)
can't be the answer by definition. So it's either (1) or (2). BTW,
that's the most general data flow; but, for the calculation summary,
I think we'd omit the last step, and do everything in RAM without
writing any file.

Here are the considerations, in no particular order:

(A) l10n and i18n. These aren't extremely important, and maybe they
never will be, because 'lmi' is necessarily bound to US regulation.
However, we shouldn't sacrifice them needlessly. Here, (2) seems to
have an advantage if we can do l10n and i18n all in one place.

(B) Separation of concerns: content vs. representation, or Model vs.
View, for instance. The question here is: where is the separation?
We could say
  content = strings (all numbers are first formatted as strings)
  representation = layout
or alternatively
  content = strings and unformatted numbers
  representation = layout + numeric formatting
and both ways separate responsibilites clearly, so I'd say both are
valid designs.

I have the impression that you see (2) as a better way of factoring
the code than (1), though--and I just can't seem to express it
convincingly myself, so I'll invite you to expand on this point if
it is indeed important to you.

(C) Theming. The example above is oversimplified: some reports would
show "5.000%" while others would need "5%" or "0.05". A single xml
file could be used for all reports. This seems to favor (2), but I'm
not really sure about that.

That "single xml file" probably is not a physical file at all, but
rather an image in RAM (except in the case of apache-fop, where a
physical file must exist). If generating a new report requires the
C++ system to generate a new RAM image, so what?

OTOH, a "single xml file" might be saved and used to generate other
reports later. Some systems like 'lmi' do that, though I think most
users ignore that feature. It would be a clear advantage for (2) if
that were useful, but I don't think our users would want it.

(D) Speed: (1) must be faster. For one thing, xslt is necessarily
interpreted. For another, consider the alternative, for 0.05:
somewhere in RAM is
  00111111101011001100110011001100110011001100110011001100110011010
and in C++ we convert it to
  "5.0000000000000000e-2"
and pass that string to xslt, which converts it back to
  00111111101011001100110011001100110011001100110011001100110011010
(accurately, we hope--see (F) below) and then converts that to
  "5.0%"
or whatever we want. That's three conversions, but formatting in C++
would require only one, while avoiding some interpretive overhead.

That said, the speed penalty might not be noticeable. We don't know
unless we measure it.

(E) Scaling, as described earlier in this thread. This is already
done in C++, but could be accomplished in xslt as well. It would be
slower to do this calculation in xslt, but speed is item (C).

(F) Accuracy and precision: I hope these wouldn't be issues, but we
are highly confident that they aren't issues with (1). Yet (2) might
need great care to keep entropy from seeping in through the triple
conversion described in (C), if that problem can be avoided at all.

Losing precision in the last place or two rarely matters, but when
it does matter, it can be troublesome. If the binary approximation
for 0.05
  00111111101011001100110011001100110011001100110011001100110011010
gets changed to
  00111111101011001100110011001100110011001100110011001100110011000
                                        only this bit is changed ^
and then formatted to one decimal place, the result will be zero
instead of "0.1". If that problem can arise, then it will; and users
will find it and complain about it. Their complaints are difficult
to answer: they may have entered "0.05" explicitly, and any
schoolchild knows how to round that correctly.

(G) IIABDFI ( http://www.jargon.ru/slova.php?cat=227&id=712 ). We've
already done (1). It is known to work.

(H) OAOO ( http://c2.com/xp/OnceAndOnlyOnce.html ). It would seem
wrong to do some numeric formatting in xslt and some in C++, and
better to do it all in one or the other. There is some that is
already done in C++ and probably should remain there: for example,
PrintFormSpecial() in 'custom_io_0.cpp', and the RegressionTest*
functions.

(I) Testability. I'm sure automated regression tests could be
written for (2), but we already have a facility in place for (1),
and it's probably easier to do numerical testing in C++.

In summary, both (1) and (2) have advantages and disadvantages.
Weighing them carefully would require a lot of investigation: we
don't know whether precision is a problem, or whether the speed
difference is noticeable, unless we measure them. However, I think
we have enough information already without spending a lot of time
on such research. I see the arguments in favor of (1) as stronger,
especially because it's already in place and has less risk. Do you
strongly disagree? Have I missed anything?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]