lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Micro-optimization in ledger_format


From: Vadim Zeitlin
Subject: Re: [lmi] Micro-optimization in ledger_format
Date: Thu, 17 Jan 2019 03:33:42 +0100

On Thu, 17 Jan 2019 01:35:30 +0000 Greg Chicares <address@hidden> wrote:

GC> Thanks, I had leapt to an incorrect conclusion. Originally, I wrote
GC> this without the '&', and now when I again try omitting it...
GC> 
GC> --- a/stream_cast_test.cpp
GC> +++ b/stream_cast_test.cpp
GC> @@ -71 +71 @@ To cast_2(From from, To = To())
GC> -std::stringstream& imbued()
GC> +std::stringstream imbued()
GC> 
GC> ...I again see these diagnostics:
GC> 
GC> /opt/lmi/src/lmi/stream_cast_test.cpp: In function 'std::stringstream 
imbued()':
GC> /opt/lmi/src/lmi/stream_cast_test.cpp:75:12: error: use of deleted function 
'std::__cxx11::basic_stringstream<_CharT, _Traits, 
_Alloc>::basic_stringstream(const std::__cxx11::basic_stringstream<_CharT, 
_Traits, _Alloc>&) [with _CharT = char; _Traits = std::char_traits<char>; 
_Alloc = std::allocator<char>]'
GC>      return interpreter;
GC>             ^~~~~~~~~~~
GC> In file included from /opt/lmi/src/lmi/stream_cast.hpp:53:0,
GC>                  from /opt/lmi/src/lmi/stream_cast_test.cpp:24:
GC> /usr/lib/gcc/i686-w64-mingw32/7.3-win32/include/c++/sstream:734:7: note: 
declared here
GC>        basic_stringstream(const basic_stringstream&) = delete;
GC>        ^~~~~~~~~~~~~~~~~~
GC> 
GC> ...so apparently at least the copy ctor is undefined

 Yes, streams are movable but not copyable.

GC> It crossed my mind to try move semantics instead, but at the time I
GC> avoided doing so because of your review comment on PR 104. Nevertheless,
GC> I just tried it now, with this diff vs. HEAD (which uses 'git diff -U0'
GC> for extreme terseness):
GC> 
GC> diff --git a/stream_cast_test.cpp b/stream_cast_test.cpp
GC> index d391ed6f..f7d471cd 100644
GC> --- a/stream_cast_test.cpp
GC> +++ b/stream_cast_test.cpp
GC> @@ -71 +71 @@ To cast_2(From from, To = To())
GC> -std::stringstream& imbued()
GC> +std::stringstream imbued()
GC> @@ -75 +75 @@ std::stringstream& imbued()
GC> -    return interpreter;
GC> +    return std::move(interpreter);
GC> @@ -83 +83 @@ To cast_3(From from, To = To())
GC> -    std::stringstream& interpreter {imbued()};
GC> +    std::stringstream&& interpreter {imbued()};
GC> @@ -103 +103 @@ To cast_4(From from, To = To())
GC> -    std::stringstream& interpreter {imbued()};
GC> +    std::stringstream&& interpreter {imbued()};

 I'd actually write this without "&&" in the last 2 chunks, i.e. I'd just
use an object:

        std::stringstream interpreter {imbued};

GC> and the result seems quite interesting:

... but it doesn't change anything and I also still see noticeable slowdown
when using move ctor (exact numbers below).

GC> In light of the extra improvement on the last two lines (e.g., compared
GC> to the original timings quoted immediately below), do you still want to
GC> follow the canonical advice cited in your PR 104 review comment, i.e.,
GC> to avoid that using rvalue references on the theory that they inhibit
GC> RVO, which "should" be more efficient...except that it seemingly isn't?

 It's not that RVO is not efficient, it's that constructing a new object,
even with this optimization, is still not as efficient as not doing it at
all. I.e. RVO is useful to optimize returning a new object, but it doesn't
avoid the overhead of creating a new object in the first place and
apparently even move-constructing a new std::stringstream is still rather
expensive.

GC> How can such a large difference as
GC>   static stream   : 2.392e-006 s mean;          2 us least of 4181 runs
GC>   static facet too: 2.929e-006 s mean;          3 us least of 3415 runs
GC> be obtained by replacing a static local variable with a reference
GC> initialized by a function call?

 I could profile this to find out where exactly is the time spent but, in
principle, it's not very surprising that creating a new object is more
expensive than not doing it.

 IOW, what you should really compare it with is this:
---------------------------------- >8 --------------------------------------
diff --git a/stream_cast_test.cpp b/stream_cast_test.cpp
index f6458217a..e7f8fd7b2 100644
--- a/stream_cast_test.cpp
+++ b/stream_cast_test.cpp
@@ -68,9 +68,9 @@ To cast_2(From from, To = To())
     return result;
 }

-std::stringstream& imbued()
+std::stringstream imbued()
 {
-    static std::stringstream interpreter;
+    std::stringstream interpreter;
     interpreter.imbue(blank_is_not_whitespace_locale());
     return interpreter;
 }
@@ -80,7 +80,7 @@ std::stringstream& imbued()
 template<typename To, typename From>
 To cast_3(From from, To = To())
 {
-    std::stringstream& interpreter {imbued()};
+    static std::stringstream interpreter {imbued()};
     interpreter.str(std::string{});
     interpreter.clear();
     To result = To();
@@ -100,7 +100,7 @@ To cast_3(From from, To = To())
 template<typename To, typename From>
 To cast_4(From from, To = To())
 {
-    std::stringstream& interpreter {imbued()};
+    static std::stringstream interpreter {imbued()};
     interpreter.clear();
     To result = To();
     if
---------------------------------- >8 --------------------------------------
and this version is actually faster than the original one.


 As an aside, I had initially forgotten to remove std::move() from the
return statement in imbued() and clang helpfully told me:

stream_cast_test.cpp:75:12: error: moving a local object in a return statement 
prevents copy elision [-Werror,-Wpessimizing-move]
    return std::move(interpreter);
           ^
stream_cast_test.cpp:75:12: note: remove std::move call here
    return std::move(interpreter);
           ^~~~~~~~~~           ~
1 error generated.

which is really quite nice of it.


GC> Or is this yet another case where I've misinterpreted the data? Or,
GC> alternatively, could it be that these measurements for MinGW gcc-7.3
GC> reflect some peculiarity version of that compiler?

 No, I see the same thing with clang 7 under Linux (which I'm using
specifically to be as different from your environment as possible).

GC> > GC> Because you reported an overall speedup
GC> > GC> in PDF generation on the order of ten percent, I figured that this
GC> > GC> subroutine would have to become about twice as fast--which is 
approximately
GC> > GC> what I observe, so...no surprise there.
GC> > 
GC> >  FWIW I see something similar under Linux with clang:
GC> > 
GC> >   Speed tests...
GC> >   stream_cast     : 3.584e-06 s mean;          3 us least of 2790 runs
GC> >   minimalistic    : 2.725e-06 s mean;          2 us least of 3670 runs
GC> >   static stream   : 1.947e-06 s mean;          1 us least of 5137 runs
GC> >   static facet too: 1.846e-06 s mean;          1 us least of 5419 runs
GC> >   without str()   : 1.574e-06 s mean;          1 us least of 6354 runs
GC> > 
GC> > but IMO in both cases the tests running time is too small and I think they
GC> > should be running for longer than they do to have more accurate results
GC> > (but I haven't bothered checking if this is really the case...).
GC> 
GC> There's an argument that lets us force the tests to take longer.
GC> Changing each of the five relevant lines in this fashion:
GC> 
GC> -        << "\n  stream_cast     : " << TimeAnAliquot(f0)
GC> +        << "\n  stream_cast     : " << TimeAnAliquot(f0, 10.0)
GC> 
GC> produces [with the move-semantics patch above]:
GC> 
GC>   stream_cast     : 4.550e-006 s mean;          3 us least of 21977 runs
GC>   minimalistic    : 3.027e-006 s mean;          3 us least of 33033 runs
GC>   static stream   : 2.153e-006 s mean;          2 us least of 46448 runs
GC>   static facet too: 2.505e-006 s mean;          2 us least of 39916 runs
GC>   without str()   : 2.481e-006 s mean;          2 us least of 40303 runs

 This is still not ideal IMHO. I've applied this change instead:
---------------------------------- >8 --------------------------------------
diff --git a/stream_cast_test.cpp b/stream_cast_test.cpp
index d391ed6f6..f6458217a 100644
--- a/stream_cast_test.cpp
+++ b/stream_cast_test.cpp
@@ -117,11 +117,11 @@ To cast_4(From from, To = To())
 void assay_speed()
 {
     static double const e {2.718281828459045};
-    auto f0 = [] {stream_cast<std::string>(e);};
-    auto f1 = [] {cast_1     <std::string>(e);};
-    auto f2 = [] {cast_2     <std::string>(e);};
-    auto f3 = [] {cast_3     <std::string>(e);};
-    auto f4 = [] {cast_4     <std::string>(e);};
+    auto f0 = [] {for (int n = 0; n < 1000; ++n) stream_cast<std::string>(e);};
+    auto f1 = [] {for (int n = 0; n < 1000; ++n) cast_1     <std::string>(e);};
+    auto f2 = [] {for (int n = 0; n < 1000; ++n) cast_2     <std::string>(e);};
+    auto f3 = [] {for (int n = 0; n < 1000; ++n) cast_3     <std::string>(e);};
+    auto f4 = [] {for (int n = 0; n < 1000; ++n) cast_4     <std::string>(e);};
     std::cout
         << "\n  Speed tests..."
         << "\n  stream_cast     : " << TimeAnAliquot(f0)
---------------------------------- >8 --------------------------------------

 Which gives something like this:

  Speed tests...
  stream_cast     : 1.522e-03 s mean;       1517 us least of 100 runs
  minimalistic    : 1.166e-03 s mean;       1157 us least of 100 runs
  static stream   : 8.668e-04 s mean;        858 us least of 100 runs
  static facet too: 8.570e-04 s mean;        850 us least of 100 runs
  without str()   : 7.687e-04 s mean;        757 us least of 100 runs

and looks more informative. Of course, not all digits of the result are
significant, but they're mostly stable.

 Now, for comparison, your version move-constructing a new interpreter
every time yield this:

  Speed tests...
  stream_cast     : 1.643e-03 s mean;       1516 us least of 100 runs
  minimalistic    : 1.242e-03 s mean;       1222 us least of 100 runs
  static stream   : 9.102e-04 s mean;        892 us least of 100 runs
  static facet too: 1.194e-03 s mean;       1164 us least of 100 runs
  without str()   : 1.089e-03 s mean;       1060 us least of 100 runs

i.e. the last 2 lines are indeed considerably slower.

 But my version above gives

  Speed tests...
  stream_cast     : 2.178e-03 s mean;       1589 us least of 100 runs
  minimalistic    : 1.203e-03 s mean;       1180 us least of 100 runs
  static stream   : 8.986e-04 s mean;        890 us least of 100 runs
  static facet too: 7.020e-04 s mean;        697 us least of 100 runs
  without str()   : 5.933e-04 s mean;        585 us least of 100 runs

i.e. is a bit faster than the original one.


GC> > this is the first time you see it, but I like it because it often allows 
me
GC> > to make a local variable const even if it has to be initialized in several
GC> > steps: if it's not modified later, I can just stash these steps into a
GC> > lambda and still initialize the variable in a single statement.
GC> 
GC> What's wrong with writing a simple auxiliary function (like imbued()
GC> in this unit test, though in a namespace and with a less cryptic name)
GC> right above its single point of use?

 It doesn't seem helpful to multiply the number of tiny functions which are
not reusable (because they're only ever meant to be used in this one
particular place to initialize this variable) by design. It also makes the
code much more difficult to understand because you don't see the expression
with which the variable is initialized when you look at it declaration. I
can live with not using IIFE but please, please let's not even consider
using separate functions for systematically initializing local variables,
this would be just horrible -- and the worst thing is that I'd be
responsible for it. If you can't tell from my words above, I feel very
strongly that it would be a spectacularly bad idea.

GC> Having slept on it, however briefly, I awake viewing this as the
GC> opposite of syntactic sugar--this is more like syntactic olestra:
GC>   https://en.wikipedia.org/wiki/Olestra#Side_effects
GC> (Be glad you live in a country that bans it.)

 Yes, maybe this is why I can't fully appreciate why is it so bad. Anyhow,
yes, IIFEs are not nice syntax-wise, but you sacrifice a tiny bit of syntax
for clearly delineating the part of code initializing the variable from the
rest.

GC> Making that function local: slight incremental benefit, but is it
GC> really worth that outlandish syntax? Can we reasonably aspire to
GC> fluency in such a language?

 Yes. Lambda syntax is unusual when you see it first, but you do get used
to it and it's so awfully convenient that you lose one of major advantages
of C++11 if you don't use it. And IIFEs are nothing more than lambdas.

 Today, I write most of my event handlers in wx applications using lambdas
and I also use lambdas heavily inside CallAfter() in multithreaded code
where they're just a godsend for the same reason: you can keep the code
which logically belongs together in the same place, even if different parts
of it are executed in different threads. More generally speaking, lambdas
are too damn useful for any kind of callbacks. And just wait until we get
coroutines in C++20.

GC> Then, once we've figured out the std::move() vs. RVO thing (well, once
GC> you've figured it out and explained it to me), we'll be ready to change
GC> both ledger_format and stream_cast<>().

 I think I did figure it out, but please let me know if you have any
questions. To summarize my findings, not creating an object at all is
faster than creating it using move ctor -- but this is not really
surprising. OTOH we may, perfectly well, use move ctor to initialize an
object created only once, and this is even faster than using a reference,
presumably because it gives the compiler more opportunities for
optimizations because the object is now local instead of being global.

 BTW, in principle, PGO should help with making your original (i.e.
current) code as fast as my version as it would discover that the "global"
variable inside imbued() is not really shared. It might be interesting to
try building lmi with PGO, it's known to result in quite significant
optimizations in many cases.

 Regards,
VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]