lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Micro-optimization in ledger_format


From: Vadim Zeitlin
Subject: Re: [lmi] Micro-optimization in ledger_format
Date: Sun, 20 Jan 2019 03:59:29 +0100

On Fri, 18 Jan 2019 23:28:33 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2019-01-18 14:15, Greg Chicares wrote:
GC> [...]
GC> > Corresponding changes to stream_cast<>() result in these
GC> > before-and-after timings in the unit test, where only the
GC> > first line is affected by commit 12cb7b91d:
GC> > 
GC> >   stream_cast     : 3.279e-003 s mean;       3203 us least of 100 runs
GC> >   minimalistic    : 2.585e-003 s mean;       2571 us least of 100 runs
GC> >   static stream   : 1.299e-003 s mean;       1159 us least of 100 runs
GC> >   static facet too: 8.441e-004 s mean;        841 us least of 100 runs
GC> >   same, but IIFE  : 8.469e-004 s mean;        840 us least of 100 runs
GC> > 
GC> >   stream_cast     : 1.915e-003 s mean;       1861 us least of 100 runs
GC> >   minimalistic    : 2.583e-003 s mean;       2568 us least of 100 runs
GC> >   static stream   : 1.528e-003 s mean;       1164 us least of 100 runs
GC> >   static facet too: 8.555e-004 s mean;        844 us least of 100 runs
GC> >   same, but IIFE  : 8.531e-004 s mean;        843 us least of 100 runs
GC> > 
GC> > Why is that latest version only half as fast as the streamlined
GC> > unit test (1861 vs 843 us)? Both have the same improvement; the
GC> > only difference is the fancy run-time error reporting in
GC> > stream_cast<>().
GC> > 
GC> > Next I plan to remove all the experimental unit-test code
GC> > (everything except the first timing line) because in retrospect
GC> > the optimized version seems obviously good.
GC> 
GC> In commit be88bed, I've removed (in effect) everything except the
GC> first and last timing lines. Somewhat puzzlingly, the best timing
GC> result is now less good:
GC> 
GC>   stream_cast: 1.907e-003 s mean;       1830 us least of 100 runs
GC>   streamlined: 1.127e-003 s mean;       1124 us least of 100 runs

 I can indeed reproduce this, although it's not quite as pronounced here.
If I run test_stream_cast built from 1fc35235c (parent of be88bed), I get

  Speed tests...
  stream_cast     : 1.300e-03 s mean;       1020 us least of 100 runs
  minimalistic    : 1.070e-03 s mean;        921 us least of 100 runs
  static stream   : 5.741e-04 s mean;        564 us least of 100 runs
  static facet too: 4.355e-04 s mean;        433 us least of 100 runs
  same, but IIFE  : 4.254e-04 s mean;        416 us least of 100 runs

(notice that these numbers are different from those given in my previous
email in http://lists.nongnu.org/archive/html/lmi/2019-01/msg00010.html
as I embarrassingly forgot that I was benchmarking non-optimized build
there). While the latest (be88bed) version gives:

  Speed tests...
  stream_cast: 1.318e-03 s mean;       1047 us least of 100 runs
  streamlined: 5.427e-04 s mean;        496 us least of 100 runs

At first glance, it's still almost 20% slower, but the minimum time is
almost the same, if I run it 50 times, i.e.

        $ repeat 50 ./test_stream_cast -a|fgrep same|cut -c45-48

for the old version and

        $ repeat 50 ./test_stream_cast -a|fgrep streamlined|cut -c40-43

for the new one, I get the minimum times of 411us and 417us respectively,
which is well inside the experiment precision. However on average the new
code is indeed slower, I attach the image showing the timings side by side
(red is the new one, of course). I'd be curious to know what's the
distribution of the results on your side, but for now my hypothesis is that
the actual time taken by the code being benchmarked itself (as would have
been shown by perf if I had bothered to run it) is the same, but because
the code of the fastest function now takes ~33% of the process execution
time instead of ~10% as before, external interruptions (and I hardly run it
on an idle machine) affect it more and this is why the average time is
higher. Does this look plausible to you?


GC> Vadim, have you any idea why this could be? It seems to suggest
GC> that the updated stream_cast<>() (which is rarely if ever used,
GC> and therefore doesn't matter much) and therefore the updated
GC> ledger_format() (which matters a great deal) may not incorporate
GC> all the improvements we've identified...but I can't see it.
GC> 
GC> Maybe it's the effect of the impending Super Blood Wolf Moon?

 I could look at the generated assembly in both cases to see what the
difference is, but at least with my measurements I wouldn't worry too much
about it: the streamlined version is still twice faster, after all.

 And if we really wanted to do this conversion in the fastest possible way,
we ought to use C++17 std::to_chars() which can be many, many times faster
than stream-based code. For me the change proposed in this thread is just
the lowest of low-hanging fruits and I don't think it's worth spending too
much time on it when we could spend this time on doing other optimizations.

 But please let me know if you disagree,
VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]