Re: [lmi] Contradictory performance measurements

lmi
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Contradictory performance measurements

From:	Vadim Zeitlin
Subject:	Re: [lmi] Contradictory performance measurements
Date:	Thu, 8 Apr 2021 15:10:47 +0200
On Wed, 7 Apr 2021 23:28:11 +0000 Greg Chicares <gchicares@sbcglobal.net> wrote:

GC> Is it faster to divide vectors by a common dividend,
GC> or to multiply them by that dividend's reciprocal?
GC> Only by measuring can we tell. But my measurements
GC> seem to contradict each other (though I know which
GC> has to be the correct one).

 I would definitely expect multiplication to be faster. The exact timings
depend on whether x87 instructions or SSE ones are used, but division is
supposed to be ~3 times slower, I believe, although it of course depends on
the data used, so measuring is still a good idea.

GC> I used this experimental patch:
GC> 
GC> --8<----8<----8<----8<----8<----8<----8<----8<--
GC> diff --git a/round_to.hpp b/round_to.hpp
GC> index 308a91f8e..fce6bf44e 100644
GC> --- a/round_to.hpp
GC> +++ b/round_to.hpp
GC> @@ -359,7 +359,8 @@ inline RealType round_to<RealType>::operator()(RealType 
r) const
GC>  {
GC>      return static_cast<RealType>
GC>          ( rounding_function_(static_cast<RealType>(r * scale_fwd_))
GC> -        * scale_back_
GC> +//      * scale_back_
GC> +        / scale_fwd_
GC>          );
GC>  }
GC>  
GC> @@ -402,7 +403,8 @@ inline currency round_to<RealType>::c(RealType r) const
GC>  {
GC>      RealType const z = static_cast<RealType>
GC>          ( rounding_function_(static_cast<RealType>(r * scale_fwd_))
GC> -        * scale_back_cents_
GC> +//      * scale_back_cents_
GC> +        / scale_fwd_cents_
GC>          );
GC>      // CURRENCY !! static_cast: possible range error
GC>      return currency(static_cast<currency::data_type>(z), raw_cents {});
GC> --8<----8<----8<----8<----8<----8<----8<----8<--
[...]
GC> Now I run
GC>   gwc/speed_test.sh
GC> which rebuilds every architecture and runs
GC>   make cli_timing
GC> for each one. With the patch, speed decreases by several
GC> percent in general, affecting all architectures.

 I've only done this for Linux x64-86 (BTW, I wonder if it you'd like to
check lmi performance on ARM 64, just for an idea of how it compares -- I
admit that I'd be curious to do it, but it might take quite some time to
make it build there, so I won't do it unless you tell me that it would be
useful) and I can confirm that I also see the slowdown from the patch and
here it affects both the mean and the least times equally, so I only give
one column (this is an average of 6 runs of each version):

                     slowdown

naic, no solve      : -5%
naic, specamt solve : -5%
naic, ee prem solve : -5%
finra, no solve     : -1%
finra, specamt solve: -3%
finra, ee prem solve: -3%

GC>                         mean least
GC>   naic, no solve      :  -2%  -1%
GC>   naic, specamt solve :  -3%  -2%
GC>   naic, ee prem solve :  -2%  -2%
GC>   finra, no solve     :  -1%  -1%
GC>   finra, specamt solve:  -1%  -1%
GC>   finra, ee prem solve:  -1%  -1%
GC> 
GC>   naic, no solve      :  -1%  -4%
GC>   naic, specamt solve :  -5%  -5%
GC>   naic, ee prem solve :  -5%  -5%
GC>   finra, no solve     :  -3%   0%
GC>   finra, specamt solve:  -4%  -4%
GC>   finra, ee prem solve:  -4%  -4%
GC> 
GC>   naic, no solve      :  -3%  -3%
GC>   naic, specamt solve :  -3%  -3%
GC>   naic, ee prem solve :  -3%  -3%
GC>   finra, no solve     :  -1%  -1%
GC>   finra, specamt solve:  -4%  -2%
GC>   finra, ee prem solve:  -3%  -2%
[keeping your values for reference]

GC> The effect is greatest for scenarios that spend more of their
GC> time in the rounding-intensive monthiversary loop.]

 My results confirm this, at least. FWIW I also see much higher standard
deviation for the first 3 lines (~3 times higher than for the last 3 lines
for both mean and least time).

GC> Great--I thought--now I can use 'prof' to find out exactly
GC> what's going on;

 I'm not sure I see the appeal of using perf here, don't we already know
what's going on? Considering that the patch does one small change (well, 2
small changes), it seems like we already have all the information we need.

GC> however, with the patch, measuring the same operation that 'make
GC> cli_timing' performs, with these commands...
GC> 
GC> 
$LD_LIBRARY_PATH=.:/opt/lmi/bin:/opt/lmi/local/gcc_x86_64-pc-linux-gnu/lib/:/srv/cache_for_lmi/perf_ln
 /srv/cache_for_lmi/perf_ln/perf_4.19 record --freq=max --call-graph=lbr 
/opt/lmi/bin/lmi_cli_shared --accept --data_path=/opt/lmi/data --selftest

 I wanted to reproduce exactly this, but my CPU is too old to support lbr
option, so I had to use dwarf (I can retry on a newer machine later, but
I'm not sure if it's going to change anything). I've also restricted self
test to the first scenario only as otherwise "perf report" was taking too
long to show the results (it's still pretty slow, and I definitely need to
upgrade the CPU on this machine, it's 10 years old by now... OTOH it still
works pretty well for most other things).

GC> $LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln 
/srv/cache_for_lmi/perf_ln/perf_4.19 report
GC> 
GC> ...and filtering for 'round_to', I see:
GC> 
GC>      0.01%     0.01%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c
GC>      0.01%     0.00%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c@plt
GC>      0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] 
round_to<double>::round_to
GC>      0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] 
round_to<double>::round_to@plt
GC> 
GC> ...which seems to suggest that, even with the patch above,
GC> 'round_to' should have virtually no effect on lmi's speed.

 I think it might be because it's getting fully inlined and perf doesn't
affect all the time spent in the instructions corresponding to it to the
function itself. In fact, I don't see round_to at all in the symbols list
anywhere.

GC> How can I resolve this apparent contradiction?

 By concluding that it's only apparent :-)

GC> with patch:
GC>   naic, no solve      : 2.064e-02 s mean;      20516 us least of  49 runs
GC>   naic, specamt solve : 3.791e-02 s mean;      37598 us least of  27 runs
GC>   naic, ee prem solve : 3.452e-02 s mean;      34250 us least of  29 runs
GC>   finra, no solve     : 5.830e-03 s mean;       5611 us least of 100 runs
GC>   finra, specamt solve: 2.154e-02 s mean;      21226 us least of  47 runs
GC>   finra, ee prem solve: 1.981e-02 s mean;      19484 us least of  51 runs
GC> 
GC> without patch:
GC>   naic, no solve      : 1.983e-02 s mean;      19725 us least of  51 runs
GC>   naic, specamt solve : 3.639e-02 s mean;      36141 us least of  28 runs
GC>   naic, ee prem solve : 3.314e-02 s mean;      33016 us least of  31 runs
GC>   finra, no solve     : 5.772e-03 s mean;       5537 us least of 100 runs
GC>   finra, specamt solve: 2.084e-02 s mean;      20510 us least of  48 runs
GC>   finra, ee prem solve: 1.924e-02 s mean;      19084 us least of  52 runs

 Just for the reference, my absolute numbers are

Test speed:
  naic, no solve      : 2.357e-02 s mean;      23270 us least of  43 runs
  naic, specamt solve : 4.327e-02 s mean;      42742 us least of  24 runs
  naic, ee prem solve : 3.998e-02 s mean;      39464 us least of  26 runs
  finra, no solve     : 7.376e-03 s mean;       7116 us least of 100 runs
  finra, specamt solve: 2.528e-02 s mean;      22656 us least of  40 runs
  finra, ee prem solve: 2.371e-02 s mean;      23088 us least of  43 runs

for the very old i7-2600 CPU and

Test speed:
  naic, no solve      : 1.761e-02 s mean;      17431 us least of  57 runs
  naic, specamt solve : 3.224e-02 s mean;      31884 us least of  32 runs
  naic, ee prem solve : 2.933e-02 s mean;      29136 us least of  35 runs
  finra, no solve     : 5.254e-03 s mean;       5135 us least of 100 runs
  finra, specamt solve: 1.846e-02 s mean;      18247 us least of  55 runs
  finra, ee prem solve: 1.701e-02 s mean;      16823 us least of  59 runs

for a newer (but still pretty old) i7-4712HQ one. I think we've discussed
this in the past, but it seems clear that Xeon is not ideal for running lmi
if a 7 year old notebook CPU can beat it so significantly.

GC> Looking harder, I used this command:
GC> 
GC> $LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln 
/srv/cache_for_lmi/perf_ln/perf_4.19 diff

 FWIW the results seem too volatile for the results you see in perf-diff to
be really significant. I.e. running perf-record twice with the same binary
and perf-diff shows ~0.4% variation for AccountValue::DoMonthDR() and ~0.3%
for its TxSetDeathBft(), GetRefundableSalesLoad(), TxSetBOMAV() methods,
while detail::round_up<double> shows ~0.1% -- again, for the same binary.

 On a faster CPU I see even greater differences (of up to 2%) when running
the same binary, so IMO anything under 1% should be completely ignored.

GC> and looked for "round" as opposed to "round_to"; that's
GC> interesting (here it is after passing through 'grep round'):
GC> 
GC>      0.41%     +0.09%  liblmi.so               [.] 
detail::round_down<double>
GC>      0.52%     -0.01%  liblmi.so               [.] detail::round_up<double>
GC>      0.01%     -0.00%  liblmi.so               [.] round_to<double>::c
GC>                +0.00%  liblmi.so               [.] 
member_cast<rounding_parameters, rounding_rules>

 So this is just measurement noise.

GC> Indeed round_to is implemented in terms of the
GC> round_{down,up} functions on the first two lines above.

 These functions together seem to take ~0.75% of the total running time for
the "naic, no solve" scenario for me, with the lion's share for round_up().
But this doesn't include the multiplication/division by scale back/fwd
anyhow, as it's not done inside these functions, but rather in
round_to::operator() itself, which is getting fully inlined -- and luckily
so, for this simple function.

GC> I have only one theory to explain this: rounding involves
GC> a great number of function calls,

 No, it definitely doesn't. Both round_{down,up}() contain just a dozen of
assembler instructions and are leaf functions that don't call any other
ones.

GC> And is there any useful thing I can do with 'perf' here,
GC> or is it just not the right tool for this job?

 Sorry, I have trouble answering this because I'm not really sure what the
job is. You've successfully demonstrated that multiplication is faster than
division, seemingly answering the original question, so what exactly are
you trying to do? If you want to get an additional confirmation for this
answer from perf (but why?), you need to find the places where round_to is
used and confirm, in the annotated listing of the corresponding function,
that DIV instruction takes more cycles than MUL did.

 But by now I'm pretty sure that this is indeed the case, so I still don't
know why would you spend time doing this. OTOH, you can get something
really useful from examining the annotated listing because, from just a
very superficial look at it, you can see that a lot of time is taken by
dividing by cents_per_dollar in currency::d(). There is, of course, already
a comment there about this, and I can't answer the question about
correctness there, but applying
---------------------------------- >8 --------------------------------------
diff --git a/currency.hpp b/currency.hpp
index ce04d75a2..a73a2b627 100644
--- a/currency.hpp
+++ b/currency.hpp
@@ -40,6 +40,7 @@ class currency

     static constexpr int    cents_digits     = 2;
     static constexpr double cents_per_dollar = 100.0;
+    static constexpr double cents_per_dollar_inv = 0.01;

   public:
     using data_type = double;
@@ -58,8 +59,7 @@ class currency

     data_type cents() const {return m_;}
     // CURRENCY !! add a unit test for possible underflow
-    // CURRENCY !! is multiplication by reciprocal faster or more accurate?
-    double d() const {return m_ / cents_per_dollar;}
+    double d() const {return m_ * cents_per_dollar_inv;}

   private:
     explicit currency(data_type z, raw_cents) : m_ {z} {}
---------------------------------- >8 --------------------------------------
does result in a dramatic speed gain in the selftest, which now gives, on
the faster machine:

Test speed:
  naic, no solve      : 1.342e-02 s mean;      13161 us least of  75 runs
  naic, specamt solve : 2.506e-02 s mean;      24802 us least of  40 runs
  naic, ee prem solve : 2.294e-02 s mean;      22759 us least of  44 runs
  finra, no solve     : 4.681e-03 s mean;       4532 us least of 100 runs
  finra, specamt solve: 1.540e-02 s mean;      15084 us least of  65 runs
  finra, ee prem solve: 1.439e-02 s mean;      14115 us least of  70 runs

i.e. a ~20% speedup.

 To conclude, I believe that division should indeed be avoided as much as
possible in favour of the multiplication and at the very least
currency::d() should be changed not to use it.

 Please let me know if you still have any questions about this and/or if
you'd like me to do anything else here.

 Thanks in advance,
VZ
pgplXmKHkOPJO.pgp
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
[lmi] Contradictory performance measurements, Greg Chicares, 2021/04/07
- Re: [lmi] Contradictory performance measurements, Vadim Zeitlin <=
  - Re: [lmi] Contradictory performance measurements, Greg Chicares, 2021/04/08
    - Re: [lmi] Contradictory performance measurements, Vadim Zeitlin, 2021/04/08
    - Re: [lmi] Contradictory performance measurements, Greg Chicares, 2021/04/08
  - [lmi] Converting cents to dollars [Was: Contradictory performance measurements], Greg Chicares, 2021/04/10
    - Re: [lmi] Converting cents to dollars, Vadim Zeitlin, 2021/04/10
- Re: [lmi] Contradictory performance measurements, Greg Chicares, 2021/04/09
Prev by Date: [lmi] Contradictory performance measurements
Next by Date: Re: [lmi] Contradictory performance measurements
Previous by thread: [lmi] Contradictory performance measurements
Next by thread: Re: [lmi] Contradictory performance measurements
Index(es):
- Date
- Thread