|
From: | Richard Henderson |
Subject: | Re: [PATCH v4 00/10] Optimize buffer_is_zero |
Date: | Thu, 15 Feb 2024 11:16:53 -1000 |
User-agent: | Mozilla Thunderbird |
On 2/14/24 22:57, Alexander Monakov wrote:
On Wed, 14 Feb 2024, Richard Henderson wrote:v3: 20240206204809.9859-1-amonakov@ispras.ru/">https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/ Changes for v4: - Keep separate >= 256 entry point, but only keep constant length check inline. This allows the indirect function call to be hidden and optimized away when the pointer is constant.Sorry, I don't understand this. Most of the improvement (at least in our testing) comes from inlining the byte checks, which often fail and eliminate call overhead entirely. Moving them out-of-line seems to lose most of the speedup the patchset was bringing, doesn't it? Is there some concern I am not seeing?
What is your benchmarking method?It was my guess that most of the improvement came from performing those early byte checks *at all*, and that the overhead of a function call to a small out of line wrapper would be negligible.
By not exposing the function pointer outside the bufferiszero translation unit, the compiler can see when the pointer is never modified for a given host, and then transform the indirect branch to a direct branch.
r~
[Prev in Thread] | Current Thread | [Next in Thread] |