qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements


From: Pranith Kumar
Subject: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
Date: Sat, 25 Mar 2017 12:52:35 -0400

Hello,

With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.


* Implement jump-to-register instruction on ARM64 to overcome the 128MB
  translation cache size limit.

  The translation cache size for an ARM64 host is currently limited to 128
  MB. This limitation is imposed by utilizing a branch instruction which
  encodes the jump offset and is limited by the number of bits it can use for
  the range of the offset. The performance impact by this limitation is severe
  and can be observed when you try to run large programs like a browser in the
  guest. The cache is flushed several times before the browser starts and the
  performance is not satisfactory. This limitation can be overcome by
  generating a branch-to-register instruction and utilizing that when the
  destination address is outside the range of what can be encoded in current
  branch instruction.

* Implement an LRU translation block code cache.

  In the current TCG design, when the translation cache fills up, we flush all
  the translated blocks (TBs) to free up space. We can improve this situation
  by not flushing the TBs that were recently used i.e., by implementing an LRU
  policy for freeing the blocks. This should avoid the re-translation overhead
  for frequently used blocks and improve performance.

* Avoid consistency overhead for strong memory model guests by generating
  load-acquire and store-release instructions.

  To run a strongly ordered guest on a weakly ordered host using MTTCG, for
  example, x86 on ARM64, we have to generate fence instructions for all the
  guest memory accesses to ensure consistency. The overhead imposed by these
  fence instructions is significant (almost 3x when compared to a run without
  fence instructions). ARM64 provides load-acquire and store-release
  instructions which are sequentially consistent and can be used instead of
  generating fence instructions. I plan to add support to generate these
  instructions in the TCG run-time to reduce the consistency overhead in
  MTTCG.

Alex Bennée, who mentored me last year, has agreed to mentor me again this
time if the proposal is accepted.

Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.

Thanks,
-- 
Pranith



reply via email to

[Prev in Thread] Current Thread [Next in Thread]