Re: [lmi] Can linux-perf illuminate this problem?

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Can linux-perf illuminate this problem?

From:	Vadim Zeitlin
Subject:	Re: [lmi] Can linux-perf illuminate this problem?
Date:	Fri, 5 Mar 2021 23:56:30 +0100

On Fri, 5 Mar 2021 22:38:38 +0000 Greg Chicares <gchicares@sbcglobal.net> wrote:

GC> Thanks, very helpful. In order to be able to "Annotate" this function,
GC> I seem to need to add '--call-graph=none' (discovered by randomly
GC> permuting flags). Then, ultimately I get here:
GC> 
GC>        │     round_to<double>::c(double) const:
GC>   0.32 │       fmull  -0x1e0(%rbp)
GC>  31.01 │       fstpl  -0x1e0(%rbp)
GC>   2.00 │       movsd  -0x1e0(%rbp),%xmm0
GC>   0.43 │     → callq  *0xd50(%r13)
GC>        │             * scale_back_cents_
GC>   0.30 │       fldt   0xd40(%r13)
GC> 
GC> so that FSTPL is apparently the problem.

 This is definitely not something I would have found from looking at the
assembly... But FST instruction is actually pretty expensive (much more
expensive than FMUL, if I'm reading this correctly!) because it doesn't
just copy the bits around, but also does rounding due to the conversion
from long double to double (or float) and also performs special handling
for infinities and other NaNs to preserve their special bit patterns. So
the important lesson, I guess, is that the conversion from long double to
double is not free at all and that it would be best to avoid it completely
if possible.

GC> FSTPL? with x86_64? Yes, since 'round_to.hpp' uses type
GC> 'long double'. Changing that to 'double' is likely to be a
GC> Really Big Change, which I'm not going to attempt soon.

 Yes, this will change the results of all the computations. OTOH it should
also be noticeably faster.

GC> But the real question is why that FSTPL costs so much. While
GC> I can't yet prove it, the reason seems clear: the argument to
GC> round_to<>::c() is very often an extreme value. In HEAD, it's
GC>   std::numeric_limits<double>::max()
GC> and the code seems equally slow if I change that to
GC>   std::numeric_limits<double>::infinity()
GC> (so that's no silver bullet).
GC> 
GC> Originally we had
GC>   double limit = SOME_BIGNUM;
GC>   double payment = some_everyday_value;
GC>   double limited_payment = std::min(limit, payment);
GC> and the x87 handled that well. It continued to work well
GC> when I semi-currency-ized it:
GC> 
GC>   double limit = SOME_BIGNUM;
GC>   currency payment = some_everyday_value;
GC>   currency limited_payment = round_gross_pmt.c(
GC>     std::min(limit, dblize(payment)
GC>     );
GC> 
GC> But when I fully currency-ized it:
GC> 
GC>   currency limit = round_gross_pmt.c(SOME_BIGNUM);
GC> 
GC> that's the single statement that slowed the whole program
GC> down painfully, because it multiplies SOME_BIGNUM by 100.0
GC> (probably causing overflow) and then FSTPL's the result.

 It should be possible to write a simple standalone test checking this
hypothesis, but I'm not sure if it's really necessary if you're going to
fix it soon anyhow.

GC> I don't yet know the best way to deal with this, but
GC> I do now know exactly what to investigate.

 Good luck!
VZ

pgpSbr9nV06gE.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Can linux-perf illuminate this problem?, Greg Chicares, 2021/03/05
- Re: [lmi] Can linux-perf illuminate this problem?, Vadim Zeitlin, 2021/03/05
  - Re: [lmi] Can linux-perf illuminate this problem?, Greg Chicares, 2021/03/05
    - Re: [lmi] Can linux-perf illuminate this problem?, Vadim Zeitlin <=

Prev by Date: Re: [lmi] Can linux-perf illuminate this problem?
Next by Date: [lmi] Specializing std::numeric_limits?
Previous by thread: Re: [lmi] Can linux-perf illuminate this problem?
Next by thread: [lmi] Specializing std::numeric_limits?
Index(es):
- Date
- Thread