qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] ideas for improving TLB performance (help with TCG backend


From: Emilio G. Cota
Subject: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
Date: Wed, 19 Sep 2018 13:54:23 -0400
User-agent: Mutt/1.9.4 (2018-02-28)

I've been thinking about ways to increase softmmu performance
by speeding up TLB accesses.

Last year, Pranith proposed to increase the size of the TLBs:
  https://patchwork.kernel.org/patch/9927793/
The problem with that approach is that it slows down flushes
significantly, since they have to memset(-1) large amounts
of memory. And flushes can be very frequent, e.g. during
bootup.

This paper quantifies this issue (with SPEC06 but also a "kernel
boot" workload), and proposes a way to avoid it:

  "Optimizing Memory Translation Emulation in Full System Emulators"
  Xin Tong, Toshihiko Koju, and Motohiro Kawahito
  https://dl.acm.org/citation.cfm?id=2686034
  The ACM version is behind a paywall, this other one is not:
  
http://domino.research.ibm.com/library/cyberdig.nsf/papers/9F3255F2937BC44885257C750004B9F7/$File/RT0956.pdf

The idea is to allocate a new TLB on a flush, thereby
removing the need for memset at flush time (the paper assumes
that the allocation+memset has previously been done, possibly in
another thread).

I like the idea of allocating a new TLB, since:

- This will work with MTTCG; we'd reclaim the old array with RCU,
  which is OK because CPUs always execute under an RCU critical section.

- The lookup "fast path" would take a hit due to executing an extra
  instruction, but as the paper shows the corresponding impact is
  very small compared to the benefits of having a larger TLB.

An additional improvement that I have thought of is to get rid
of memset(-1) altogether. Instead, we'd store addresses in the TLB
as $real_address+1, so that 0xff..ff is stored as 0x00..00. That way,
instead of malloc+memset we'd just calloc a new TLB, which
should be much faster since we'd most likely get zeroed pages
from mmap. The cost would be an additional instruction in the fast
path to subtract 1 from the address in the TLB, but this extra
instruction would be essentially free in modern CPUs.

I have looked into implementing this approach but it would take me
a long time to get proficient enough to generate the code I want from
the i386 TCG backend.

If someone could help with that, I could take care of the rest, i.e.
changes to C code and measuring the perf impact. If we got good
results, we could then look into implementing this for all TCG
backends.

BTW the paper also has other interesting ideas, for example
"uninlining" TLB lookups, which they claim increases performance
by 6%. I also looked into this but I fail to see how this could
ever be maintainable, since we'd have to generate many
subroutines, one for each combination of generation-time
parameters that tcg_out_tlb_load takes.

Thanks,

                Emilio




reply via email to

[Prev in Thread] Current Thread [Next in Thread]