[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht
From: |
Igor Mammedov |
Subject: |
Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht |
Date: |
Wed, 10 Aug 2016 15:36:59 +0200 |
On Fri, 10 Jun 2016 07:26:52 -0700
Richard Henderson <address@hidden> wrote:
This patch make SIGSEGVs QEMU when debugging KVM guest with gdb
Steps to reproduce:
Seabios:
clone and build current master with
defconfig plus CONFIG_RELOCATE_INIT=n on top of it
QEMU:
./configure --enable-debug --target-list=x86_64-softmmu
./x86_64-softmmu/qemu-system-x86_64 -enable-kvm -bios
../seabios/out/bios.bin -s -S
GDB:
gdb ../seabios/out/rom.o
(gdb) b smp_setup
(gdb) target remote localhost:1234
(gdb) c
Continuing.
Remote connection closed
QEMU's backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x0000555555c19c6f in qht_reset_size (ht=0x555556240058 <tcg_ctx+216>,
n_elems=0x8000) at util/qht.c:422
422 if (n_buckets != map->n_buckets) {
(gdb) bt
#0 0x0000555555c19c6f in qht_reset_size (ht=0x555556240058 <tcg_ctx+216>,
n_elems=0x8000) at util/qht.c:422
#1 0x0000555555754bdd in tb_flush (cpu=0x0) at
/home/imammedo/builds/qemu/translate-all.c:855
#2 0x000055555579026e in gdb_vm_state_change (opaque=0x0, running=0x0,
state=RUN_STATE_DEBUG) at /home/imammedo/builds/qemu/gdbstub.c:1276
#3 0x00005555558be59f in vm_state_notify (running=0x0, state=RUN_STATE_DEBUG)
at vl.c:1585
#4 0x000055555578137e in do_vm_stop (state=RUN_STATE_DEBUG) at
/home/imammedo/builds/qemu/cpus.c:743
#5 0x0000555555782b5e in vm_stop (state=RUN_STATE_DEBUG) at
/home/imammedo/builds/qemu/cpus.c:1476
#6 0x00005555558bebdf in main_loop_should_exit () at vl.c:1856
#7 0x00005555558bed57 in main_loop () at vl.c:1912
#8 0x00005555558c678a in main (argc=0x6, argv=0x7fffffffe0c8,
envp=0x7fffffffe100) at vl.c:4603
> From: "Emilio G. Cota" <address@hidden>
>
> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
>
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
>
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
>
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
> MRU is implemented here by adding an intermediate struct
> that contains the u32 hash and a pointer to the TB; this
> allows us, on an MRU promotion, to copy said struct (that is not
> at the head), and put this new copy at the head. After a grace
> period, the original non-head struct can be eliminated, and
> after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
> no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
> no MRU for lookups; MRU for inserts.
>
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for
> qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
> http://imgur.com/a/Awgnq
>
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
>
> Host: Intel Xeon E5-2690
>
> 28 ++------------+-------------+-------------+-------------+------------++
> A***** + + + master **A*** +
> 27 ++ * xxhash ##B###++
> | A******A****** xxhash-rcu $$C$$$ |
> 26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
> D%%$$ A******A******A*qht-dyn-mru A*E****A
> 25 ++ %%$$ qht-dyn-nomru &&F&&&++
> B#####% |
> 24 ++ #C$$$$$ ++
> | B### $ |
> | ## C$$$$$$ |
> 23 ++ # C$$$$$$ ++
> | B###### C$$$$$$ %%%D
> 22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
> | D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
> 21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
> + E@@@ F&&& + E@ + F&&& + +
> 20 ++------------+-------------+-------------+-------------+------------++
> 14 16 18 20 22 24
> log2 number of buckets
>
> Host: Intel i7-4790K
>
> 14.5 ++------------+------------+-------------+------------+------------++
> A** + + + master **A*** +
> 14 ++ ** xxhash ##B###++
> 13.5 ++ ** xxhash-rcu $$C$$$++
> | qht-fixed-nomru %%D%%% |
> 13 ++ A****** qht-dyn-mru @@E@@@++
> | A*****A******A****** qht-dyn-nomru &&F&&& |
> 12.5 C$$ A******A******A*****A****** ***A
> 12 ++ $$ A*** ++
> D%%% $$ |
> 11.5 ++ %% ++
> B### %C$$$$$$ |
> 11 ++ ## D%%%%% C$$$$$ ++
> | # % C$$$$$$ |
> 10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
> 10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
> + F&& D%%%%%%B######B######B#####B###@@@D%%% +
> 9.5 ++------------+------------+-------------+------------+------------++
> 14 16 18 20 22 24
> log2 number of buckets
>
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
>
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
>
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
>
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
>
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
>
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
>
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
>
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
>
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
>
> Reviewed-by: Sergey Fedorov <address@hidden>
> Reviewed-by: Alex Bennée <address@hidden>
> Reviewed-by: Richard Henderson <address@hidden>
> Signed-off-by: Emilio G. Cota <address@hidden>
> Message-Id: <address@hidden>
> Signed-off-by: Richard Henderson <address@hidden>
> ---
> cpu-exec.c | 86
> +++++++++++++++++++++++------------------------
> include/exec/exec-all.h | 2 --
> include/exec/tb-context.h | 7 ++--
> include/exec/tb-hash.h | 3 +-
> translate-all.c | 85 +++++++++++++++++++++-------------------------
> 5 files changed, 86 insertions(+), 97 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index b9e294c..b840e1d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -225,57 +225,57 @@ static void cpu_exec_nocache(CPUState *cpu, int
> max_cycles,
> }
> #endif
>
> +struct tb_desc {
> + target_ulong pc;
> + target_ulong cs_base;
> + CPUArchState *env;
> + tb_page_addr_t phys_page1;
> + uint32_t flags;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> + const TranslationBlock *tb = p;
> + const struct tb_desc *desc = d;
> +
> + if (tb->pc == desc->pc &&
> + tb->page_addr[0] == desc->phys_page1 &&
> + tb->cs_base == desc->cs_base &&
> + tb->flags == desc->flags) {
> + /* check next page if needed */
> + if (tb->page_addr[1] == -1) {
> + return true;
> + } else {
> + tb_page_addr_t phys_page2;
> + target_ulong virt_page2;
> +
> + virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> + phys_page2 = get_page_addr_code(desc->env, virt_page2);
> + if (tb->page_addr[1] == phys_page2) {
> + return true;
> + }
> + }
> + }
> + return false;
> +}
> +
> static TranslationBlock *tb_find_physical(CPUState *cpu,
> target_ulong pc,
> target_ulong cs_base,
> uint32_t flags)
> {
> - CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> - TranslationBlock *tb, **tb_hash_head, **ptb1;
> + tb_page_addr_t phys_pc;
> + struct tb_desc desc;
> uint32_t h;
> - tb_page_addr_t phys_pc, phys_page1;
>
> - /* find translated block using physical mappings */
> - phys_pc = get_page_addr_code(env, pc);
> - phys_page1 = phys_pc & TARGET_PAGE_MASK;
> + desc.env = (CPUArchState *)cpu->env_ptr;
> + desc.cs_base = cs_base;
> + desc.flags = flags;
> + desc.pc = pc;
> + phys_pc = get_page_addr_code(desc.env, pc);
> + desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
> h = tb_hash_func(phys_pc, pc, flags);
> -
> - /* Start at head of the hash entry */
> - ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> - tb = *ptb1;
> -
> - while (tb) {
> - if (tb->pc == pc &&
> - tb->page_addr[0] == phys_page1 &&
> - tb->cs_base == cs_base &&
> - tb->flags == flags) {
> -
> - if (tb->page_addr[1] == -1) {
> - /* done, we have a match */
> - break;
> - } else {
> - /* check next page if needed */
> - target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) +
> - TARGET_PAGE_SIZE;
> - tb_page_addr_t phys_page2 = get_page_addr_code(env,
> virt_page2);
> -
> - if (tb->page_addr[1] == phys_page2) {
> - break;
> - }
> - }
> - }
> -
> - ptb1 = &tb->phys_hash_next;
> - tb = *ptb1;
> - }
> -
> - if (tb) {
> - /* Move the TB to the head of the list */
> - *ptb1 = tb->phys_hash_next;
> - tb->phys_hash_next = *tb_hash_head;
> - *tb_hash_head = tb;
> - }
> - return tb;
> + return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
> }
>
> static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index e076397..c1f59fa 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -215,8 +215,6 @@ struct TranslationBlock {
>
> void *tc_ptr; /* pointer to the translated code */
> uint8_t *tc_search; /* pointer to search data */
> - /* next matching tb for physical address. */
> - struct TranslationBlock *phys_hash_next;
> /* original tb when cflags has CF_NOCACHE */
> struct TranslationBlock *orig_tb;
> /* first and second physical page containing code. The lower bit
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 5efe3d9..e209c1c 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -21,9 +21,10 @@
> #define QEMU_TB_CONTEXT_H_
>
> #include "qemu/thread.h"
> +#include "qemu/qht.h"
>
> -#define CODE_GEN_PHYS_HASH_BITS 15
> -#define CODE_GEN_PHYS_HASH_SIZE (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS 15
> +#define CODE_GEN_HTABLE_SIZE (1 << CODE_GEN_HTABLE_BITS)
>
> typedef struct TranslationBlock TranslationBlock;
> typedef struct TBContext TBContext;
> @@ -31,7 +32,7 @@ typedef struct TBContext TBContext;
> struct TBContext {
>
> TranslationBlock *tbs;
> - TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> + struct qht htable;
> int nb_tbs;
> /* any access to the tbs or the page table must use this lock */
> QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 88ccfd1..1d0200b 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
> #ifndef EXEC_TB_HASH
> #define EXEC_TB_HASH
>
> -#include "exec/exec-all.h"
> #include "exec/tb-hash-xx.h"
>
> /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -49,7 +48,7 @@ static inline unsigned int
> tb_jmp_cache_hash_func(target_ulong pc)
> static inline
> uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t
> flags)
> {
> - return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> + return tb_hash_func5(phys_pc, pc, flags);
> }
>
> #endif
> diff --git a/translate-all.c b/translate-all.c
> index d75737c..b620fcc 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
> qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
> }
>
> +static void tb_htable_init(void)
> +{
> + unsigned int mode = QHT_MODE_AUTO_RESIZE;
> +
> + qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
> /* Must be called before using the QEMU cpus. 'tb_size' is the size
> (in bytes) allocated to the translation buffer. Zero means default
> size. */
> @@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
> {
> cpu_gen_init();
> page_init();
> + tb_htable_init();
> code_gen_alloc(tb_size);
> #if defined(CONFIG_SOFTMMU)
> /* There's no guest base to take into account, so go ahead and
> @@ -846,7 +854,7 @@ void tb_flush(CPUState *cpu)
> cpu->tb_flushed = true;
> }
>
> - memset(tcg_ctx.tb_ctx.tb_phys_hash, 0,
> sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> + qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
> page_flush_tb();
>
> tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -857,60 +865,46 @@ void tb_flush(CPUState *cpu)
>
> #ifdef DEBUG_TB_CHECK
>
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
> {
> - TranslationBlock *tb;
> - int i;
> + TranslationBlock *tb = p;
> + target_ulong addr = *(target_ulong *)userp;
>
> - address &= TARGET_PAGE_MASK;
> - for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> - for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> - tb = tb->phys_hash_next) {
> - if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> - address >= tb->pc + tb->size)) {
> - printf("ERROR invalidate: address=" TARGET_FMT_lx
> - " PC=%08lx size=%04x\n",
> - address, (long)tb->pc, tb->size);
> - }
> - }
> + if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> + printf("ERROR invalidate: address=" TARGET_FMT_lx
> + " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
> }
> }
>
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
> {
> - TranslationBlock *tb;
> - int i, flags1, flags2;
> -
> - for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> - for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> - tb = tb->phys_hash_next) {
> - flags1 = page_get_flags(tb->pc);
> - flags2 = page_get_flags(tb->pc + tb->size - 1);
> - if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> - printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> - (long)tb->pc, tb->size, flags1, flags2);
> - }
> - }
> - }
> + address &= TARGET_PAGE_MASK;
> + qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
> }
>
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock
> *tb)
> +static void
> +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
> {
> - TranslationBlock *tb1;
> + TranslationBlock *tb = p;
> + int flags1, flags2;
>
> - for (;;) {
> - tb1 = *ptb;
> - if (tb1 == tb) {
> - *ptb = tb1->phys_hash_next;
> - break;
> - }
> - ptb = &tb1->phys_hash_next;
> + flags1 = page_get_flags(tb->pc);
> + flags2 = page_get_flags(tb->pc + tb->size - 1);
> + if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> + printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> + (long)tb->pc, tb->size, flags1, flags2);
> }
> }
>
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> + qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +}
> +
> +#endif
> +
> static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock
> *tb)
> {
> TranslationBlock *tb1;
> @@ -998,7 +992,7 @@ void tb_phys_invalidate(TranslationBlock *tb,
> tb_page_addr_t page_addr)
> /* remove the TB from the hash list */
> phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> - tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> + qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>
> /* remove the TB from the page list */
> if (tb->page_addr[0] != page_addr) {
> @@ -1128,13 +1122,10 @@ static void tb_link_page(TranslationBlock *tb,
> tb_page_addr_t phys_pc,
> tb_page_addr_t phys_page2)
> {
> uint32_t h;
> - TranslationBlock **ptb;
>
> /* add in the hash table */
> h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> - ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> - tb->phys_hash_next = *ptb;
> - *ptb = tb;
> + qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>
> /* add in the page list */
> tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
- Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht,
Igor Mammedov <=