[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
From: |
Pranith Kumar |
Subject: |
[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements |
Date: |
Sat, 25 Mar 2017 12:52:35 -0400 |
Hello,
With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.
* Implement jump-to-register instruction on ARM64 to overcome the 128MB
translation cache size limit.
The translation cache size for an ARM64 host is currently limited to 128
MB. This limitation is imposed by utilizing a branch instruction which
encodes the jump offset and is limited by the number of bits it can use for
the range of the offset. The performance impact by this limitation is severe
and can be observed when you try to run large programs like a browser in the
guest. The cache is flushed several times before the browser starts and the
performance is not satisfactory. This limitation can be overcome by
generating a branch-to-register instruction and utilizing that when the
destination address is outside the range of what can be encoded in current
branch instruction.
* Implement an LRU translation block code cache.
In the current TCG design, when the translation cache fills up, we flush all
the translated blocks (TBs) to free up space. We can improve this situation
by not flushing the TBs that were recently used i.e., by implementing an LRU
policy for freeing the blocks. This should avoid the re-translation overhead
for frequently used blocks and improve performance.
* Avoid consistency overhead for strong memory model guests by generating
load-acquire and store-release instructions.
To run a strongly ordered guest on a weakly ordered host using MTTCG, for
example, x86 on ARM64, we have to generate fence instructions for all the
guest memory accesses to ensure consistency. The overhead imposed by these
fence instructions is significant (almost 3x when compared to a run without
fence instructions). ARM64 provides load-acquire and store-release
instructions which are sequentially consistent and can be used instead of
generating fence instructions. I plan to add support to generate these
instructions in the TCG run-time to reduce the consistency overhead in
MTTCG.
Alex Bennée, who mentored me last year, has agreed to mentor me again this
time if the proposal is accepted.
Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.
Thanks,
--
Pranith