|
From: | Richard Henderson |
Subject: | Re: [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND |
Date: | Mon, 7 Aug 2023 11:57:55 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
On 8/7/23 07:28, Helge Deller wrote:
The tcg uses tgen_arithi(ARITH_AND) during fast CPU TLB lookups, which e.g. translates to: 0x7ff5b011556a: 48 81 e6 00 f0 ff ff andq $0xfffffffffffff000, %rsi In case the upper 48 bits are all set, the shorter sequence to operate on the lower 16 bits of the target reg (si) can be used, which will then be a 2 bytes shorter instruction sequence: 0x7f4488097b31: 66 81 e6 00 f0 andw $0xf000, %si Signed-off-by: Helge Deller <deller@gmx.de>
Current Intel optimization guidelines https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual.htmlSection 3.4.2.3, Length Changing Prefixes, suggests that using 16-byte operands slows decode from 1 cycle to 6 cycles.
Section 3.5.2.3, Partial Register Stalls, says that Skylake has fixed the major issues that older microarchitectures had with such stalls, but that these operations have two additional cycles of delay.
So on balance I don't think this is a good tradeoff. r~
[Prev in Thread] | Current Thread | [Next in Thread] |