[lmi] Toward a 7702A testing strategy

lmi
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] Toward a 7702A testing strategy

From:	Greg Chicares
Subject:	[lmi] Toward a 7702A testing strategy
Date:	Fri, 06 Jan 2006 14:13:27 +0000
User-agent:	Mozilla Thunderbird 1.0.2 (Windows/20050317)
Later this year, we'll rewrite 'ihs_irc7702a.?pp' to reflect new
specifications and address known inaccuracies in its implementation
of section 7702A of the US tax code. It would be unwise to commence
that work without a testing strategy. The statute prescribes a
bright-line test, and an error of one cent in the wrong direction
can lead to tens of thousands of dollars in fees and penalties.

Here's an advocacy piece I wrote last year, slightly retouched.

I. Validate 7702A calculations with a unit test, not a system test.

Let's use these definitions from the IEEE Standard Computer Dictionary:

http://www.sei.cmu.edu/str/indexes/glossary/index_body.html
|
| System testing: testing conducted on a complete, integrated system
| to evaluate the system's compliance with its specified requirements
|
| Unit testing: testing of individual hardware or software units or
| groups of related units

7702A calculations are one "unit" among many in a life-insurance admin
or illustration system. One unit test might look like this:

- original state
  non-MEC = status
        3 = 7702A "contract year"
   100000 = account value
  1000000 = specified amount
  1000000 = 7702A "lowest death benefit"

- transaction
   500000 = specified-amount decrease

- resulting state (what the test validates)
  non-MEC = status
        3 = 7702A "contract year"
   100000 = account value
   500000 = specified amount
        0 = seven-pay premium

That's simplified (I omitted age, NSP, and many other things) only
for presentation of the concept. A real-world unit test is as simple
as it can be (but no simpler). It isolates the inputs and outputs we
care about. What it isolates is small enough to calculate by hand.
If you repeat a unit test and find a discrepancy, you immediately
know exactly which calculation must be broken, and writing a defect
report is trivial.

A system test, on the other hand, might look like this:

- setup
  1000000 = specified amount
       45 = issue age
    40000 = annual premium to be paid every year
  context: one particular product

- outcome that we really care about
    50000 = original seven-pay premium
        9 = year it becomes a MEC

- what we actually would wind up testing
    zillions of values

That's extremely simplified, of course. The output is all daily or
monthly values--product mechanics are commingled with what we really
care about. We can extract some subset like ninth-year 7702A values,
but we can hardly know that the benchmark results we originally stored
for comparison are correct--we'd have to calculate all values by hand,
and that's infeasibly laborious. If some detail of monthiversary
processing is later changed, the benchmark becomes invalid. Therefore,
if we find a discrepancy when we repeat the tests later, we don't know
whether it has anything to do with 7702A. System testing isn't the
appropriate way to test a unit. Unit testing is.

II. Automate all tests. Use all tests as regression tests.

http://www.sei.cmu.edu/str/indexes/glossary/index_body.html
|
| Regression testing: selective retesting of a system or component to
| verify that modifications have not caused unintended effects and
| that the system or component still complies with its specified
| requirements

Unit tests can be regression tests. System tests can be regression
tests. Any test you run repeatedly is a regression test. (Originally,
the term meant only tests added as a result of fixing a defect; it
ensured that behavior didn't "regress" to its previous defective
state. But it's evolved to encompass any test that guards against
deviation from an original correct behavior.)

"Automate all tests" may be seen as a polemical statement:

http://en.wikipedia.org/wiki/Software_testing
|
| Manual vs. Automated
|
| Some writers believe that test automation is so expensive relative
| to its value that it should be used sparingly. Others, such as
| advocates of agile development, recommend automating 100% of all
| tests. A challenge with automation is that automated testing
| requires automated test oracles (an oracle is a mechanism or
| principle by which a problem in the software can be recognized).

"Agile development" advocates respond: tests must run quickly, so
design them with speed in mind; and manual is the opposite of fast, so
design an automated testing framework up front. Tests that take a day
to run are tests a vendor would ask us not to run often. Manual tests
suffer from a resource constraint on our end: "it'd take three person-
days to run the whole test suite, and we can spare only half a person,
but we have to release this tomorrow".

"Agile" developers would also quarrel with the word "selective" in the
definition above. If the tests are fast enough, then even thinking
about being "selective" takes more time than running the whole suite
repeatedly.

http://xp.c2.com/ContinuousIntegration.html
| we have UnitTests that check whether (a) class X works, and (b) all
| other classes work in the context of class X. The UnitTests run in
| under 5 minutes, checking everything. (We are only testing around
| 1000 classes, with only about 20,000 individual checks, but, well, we
| all know Smalltalk is slow. C++ would probably be a lot faster. ;-> )

Rounding is crucial in lmi. It has 8216 tests, and they all run in
thirteen thousandths of a second. Maybe we could do with fewer tests,
but why spend any thought on that? More complicated calculations would
take a little more time; we test about forty other units, and the
whole suite runs in less than ten seconds. Our system tests take eight
minutes for 1327 tests, of which 295 are intended specifically to test
7702A--though none of the 295 has been verified. It's interesting to
compare these two approaches (numbers rounded):

  approach:                   all system tests   rounding unit tests
  number of tests:            1300               8000
  time:                       500 s              0.013 s
  time per test:              0.38s              0.0000016s
  number of values:           13000000           8000
  time per value:             0.000038s          0.0000016s
  % of values verified:       about 0%           100%
  is it right?                uh, dunno          yes, assuredly

Spending a fraction of a second to test a few thousand values, each
of which has been carefully validated, is sane. Spending several
minutes to test over ten million values, few or none of which we
really know to be correct, is a different matter. My main regrets are
that we don't have enough unit tests yet, and that we tried to write
system tests where unit tests would have been more powerful as well
as faster--notably, for 7702 and 7702A. I intend not to repeat that
mistake.

Unit tests are the fastest tests, so prefer them for testing units
like 7702A. System tests are also necessary, but don't try to use
them where unit tests are more appropriate. Some testing is going to
remain manual ("they changed the webpage to blue on green, so it's
inaccessible to the colorblind"), but prefer automation wherever it's
feasible. Automation, of course, gives the lowest lifetime costs,
because computers check output more cheaply than people can.

III. 7702A testing

Consider the above points in the 7702A context. The lmi system tests
for 7702A are documented like this (small sample):

  ben incr [8] with unnec prem [8] (opt A) nonmec [defect?]
  ben incr [8] with unnec prem [8] (opt B) nonmec [defect?]
  ben incr [8] with unnec prem [8] (opt ROP) nonmec [defect?]
  opt B, w/d, but not below init DB - nonmec, cvat
  opt ROP, w/d, not below init DB - non-mec, cvat
  -50% SA rate, nec prem pmt [8] (AV < DCV), cvat
  12% cred rate, nec prem pmt [8] (DCV < AV), cvat

It is interesting to inquire what happens when, I guess, the full
necessary premium is paid in year eight, and to consider how that
might differ depending on whether the account value is higher or lower
than the deemed cash value. And it's clever to test the latter
condition by manipulating the separate- and general-account rates.
Unfortunately, forcing 7702A into system testing requires that sort
of cleverness. So much cleverness was expended frivolously that not
enough remained to look into the suspected defects.

Each of the 295 7702A system tests produces about ten thousand
year-end values, few of which are relevant to 7702A. Monthly detail
would be about twelve times as long and take twelve times as much
time, so we skipped it to keep the tests fast--even though 7702A
events can occur on any monthiversary in an illustration, and even
though tracking down a discrepancy in regression testing probably
requires generating and studying monthly detail.

Since this was originally written six months ago, we've found that
off-anniversary MEC testing is incorrect, in a very general way that
could easily have been detected by a simple unit test. The scope of
the system test was too vast (295 * ~10000 = about 3000000 values),
yet at the same time too narrow (zero valid off-anniversary tests).
And there's another grave problem: none of the 2950000 values is
known to be correct. Some are thought to have been validated to some
degree by hand, but which, and how, are questions with no documented
answers. These are mistakes to learn from, not to repeat.

We haven't yet made the time to do 7702A unit tests the right way,
but there's a sketch here:

http://savannah.nongnu.org/cgi-bin/viewcvs/*checkout*/lmi/lmi/irc7702a_test.cpp

Let me translate the 'test02' function you'll find there:

  first month of first year
    100000 specified amount
      1000 payment
  * test: it shouldn't be a MEC

  second month of first year
     99999 payment
  * test: it should be a MEC

There are good reasons why the code is much more verbose than that,
but that's all it really says.

This is independent of any product. It just uses dummy values, like
  NSP: .1, .2, .3 for the first three years
which have the great virtue of simplifying hand calculations.

There are only seven such tests; they take about two-thousands of a
second each. The system tests mentioned above take a third of a
second each--more than two orders of magnitude slower. A complete
7702A unit-test suite might have a thousand tests and take half a
second to run. It would take a month to write and validate that many
tests at a rate of ten minutes apiece, but that's the total lifetime
cost unless the tax law changes, because half a second of computer
time costs nothing even if you spend it every day. Can't spend a
month? Then spend what you can afford, knowing that you're spending
it in the most effective way possible.

We can test lmi this way, because lmi is under our control. Other
systems (e.g., for administration) can and should use the same tests.
For this to be feasible, we need an interface to the target system's
calculations. We need to be able to say "jam in these values at this
point in time" and "spit back these values after a 7702A cycle", and
those have to be machine rather than human instructions--no rekeying.
And we need a program on this end to send those instructions, receive
the results, and compare the results against values known to be
correct. These are not exotic demands that contemporary technology
can't meet. There's a term for it--client-server--and that's a core
technology in any company's systems strategy today. Insurers should
make this a nonnegotiable requirement for any vendor system.

Of course, once the test suite is established, it's not hard to
apply it to every system. And once there's an automated test suite
for every system, it's not hard to make them all match closely.

Here, I've restricted the scope to abstract 7702A transactions.
Tabular values are easily checked by other means that can easily be
demonstrated. If things like NSPs are to be calculated from first
principles, lmi has some unit tests for that already. Deemed cash
value is a chimera that deserves separate consideration, but is
beyond the scope of this posting.
[Prev in Thread]
Current Thread
[Next in Thread]
[lmi] Toward a 7702A testing strategy, Greg Chicares <=
Prev by Date: Re: [lmi] Building with shared-library attributes
Next by Date: RE: Re[2]: [lmi] first version of multi dimensional data editor control checked in
Previous by thread: RE: [lmi] first version of multi dimensional data editor control checked in
Next by thread: Re: [lmi] malloc-debugger problems
Index(es):
- Date
- Thread