Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht

From:	Igor Mammedov
Subject:	Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht
Date:	Wed, 10 Aug 2016 15:36:59 +0200
On Fri, 10 Jun 2016 07:26:52 -0700
Richard Henderson <address@hidden> wrote:

This patch make SIGSEGVs QEMU when debugging KVM guest with gdb

Steps to reproduce:

Seabios:
  clone and build current master with
     defconfig plus CONFIG_RELOCATE_INIT=n on top of it

QEMU:
  ./configure --enable-debug --target-list=x86_64-softmmu
  ./x86_64-softmmu/qemu-system-x86_64 -enable-kvm  -bios 
../seabios/out/bios.bin -s -S

GDB:
  gdb ../seabios/out/rom.o
  (gdb) b smp_setup
  (gdb) target remote localhost:1234
  (gdb) c
  Continuing.
  Remote connection closed


QEMU's backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0000555555c19c6f in qht_reset_size (ht=0x555556240058 <tcg_ctx+216>, 
n_elems=0x8000) at util/qht.c:422
422         if (n_buckets != map->n_buckets) {
(gdb) bt
#0  0x0000555555c19c6f in qht_reset_size (ht=0x555556240058 <tcg_ctx+216>, 
n_elems=0x8000) at util/qht.c:422
#1  0x0000555555754bdd in tb_flush (cpu=0x0) at 
/home/imammedo/builds/qemu/translate-all.c:855
#2  0x000055555579026e in gdb_vm_state_change (opaque=0x0, running=0x0, 
state=RUN_STATE_DEBUG) at /home/imammedo/builds/qemu/gdbstub.c:1276
#3  0x00005555558be59f in vm_state_notify (running=0x0, state=RUN_STATE_DEBUG) 
at vl.c:1585
#4  0x000055555578137e in do_vm_stop (state=RUN_STATE_DEBUG) at 
/home/imammedo/builds/qemu/cpus.c:743
#5  0x0000555555782b5e in vm_stop (state=RUN_STATE_DEBUG) at 
/home/imammedo/builds/qemu/cpus.c:1476
#6  0x00005555558bebdf in main_loop_should_exit () at vl.c:1856
#7  0x00005555558bed57 in main_loop () at vl.c:1912
#8  0x00005555558c678a in main (argc=0x6, argv=0x7fffffffe0c8, 
envp=0x7fffffffe100) at vl.c:4603


> From: "Emilio G. Cota" <address@hidden>
> 
> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
> 
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
> 
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
> 
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
>               MRU is implemented here by adding an intermediate struct
>               that contains the u32 hash and a pointer to the TB; this
>               allows us, on an MRU promotion, to copy said struct (that is not
>               at the head), and put this new copy at the head. After a grace
>               period, the original non-head struct can be eliminated, and
>               after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
>                    no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
>                  no MRU for lookups; MRU for inserts.
> 
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for 
> qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
>   http://imgur.com/a/Awgnq
> 
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
> 
>                             Host: Intel Xeon E5-2690
> 
>   28 ++------------+-------------+-------------+-------------+------------++
>      A*****        +             +             +             master **A*** +
>   27 ++    *                                                 xxhash ##B###++
>      |      A******A******                               xxhash-rcu $$C$$$ |
>   26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
>      D%%$$                              A******A******A*qht-dyn-mru A*E****A
>   25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
>      B#####%                                                               |
>   24 ++    #C$$$$$                                                        ++
>      |      B###  $                                                        |
>      |          ## C$$$$$$                                                 |
>   23 ++           #       C$$$$$$                                         ++
>      |             B######       C$$$$$$                                %%%D
>   22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
>      |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
>   21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
>      +             E@@@   F&&&   +      E@     +      F&&&   +             +
>   20 ++------------+-------------+-------------+-------------+------------++
>      14            16            18            20            22            24
>                              log2 number of buckets
> 
>                                  Host: Intel i7-4790K
> 
>   14.5 ++------------+------------+-------------+------------+------------++
>        A**           +            +             +            master **A*** +
>     14 ++ **                                                 xxhash ##B###++
>   13.5 ++   **                                           xxhash-rcu $$C$$$++
>        |                                            qht-fixed-nomru %%D%%% |
>     13 ++     A******                                   qht-dyn-mru @@E@@@++
>        |             A*****A******A******             qht-dyn-nomru &&F&&& |
>   12.5 C$$                               A******A******A*****A******    ***A
>     12 ++ $$                                                        A***  ++
>        D%%% $$                                                             |
>   11.5 ++  %%                                                             ++
>        B###  %C$$$$$$                                                      |
>     11 ++  ## D%%%%% C$$$$$                                               ++
>        |     #      %      C$$$$$$                                         |
>   10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
>     10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
>        +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
>    9.5 ++------------+------------+-------------+------------+------------++
>        14            16           18            20           22            24
>                               log2 number of buckets
> 
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
> 
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
> 
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
> 
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
> 
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
> 
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
> 
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
> 
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
> 
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
> 
> Reviewed-by: Sergey Fedorov <address@hidden>
> Reviewed-by: Alex Bennée <address@hidden>
> Reviewed-by: Richard Henderson <address@hidden>
> Signed-off-by: Emilio G. Cota <address@hidden>
> Message-Id: <address@hidden>
> Signed-off-by: Richard Henderson <address@hidden>
> ---
>  cpu-exec.c                | 86 
> +++++++++++++++++++++++------------------------
>  include/exec/exec-all.h   |  2 --
>  include/exec/tb-context.h |  7 ++--
>  include/exec/tb-hash.h    |  3 +-
>  translate-all.c           | 85 +++++++++++++++++++++-------------------------
>  5 files changed, 86 insertions(+), 97 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index b9e294c..b840e1d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -225,57 +225,57 @@ static void cpu_exec_nocache(CPUState *cpu, int 
> max_cycles,
>  }
>  #endif
>  
> +struct tb_desc {
> +    target_ulong pc;
> +    target_ulong cs_base;
> +    CPUArchState *env;
> +    tb_page_addr_t phys_page1;
> +    uint32_t flags;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> +    const TranslationBlock *tb = p;
> +    const struct tb_desc *desc = d;
> +
> +    if (tb->pc == desc->pc &&
> +        tb->page_addr[0] == desc->phys_page1 &&
> +        tb->cs_base == desc->cs_base &&
> +        tb->flags == desc->flags) {
> +        /* check next page if needed */
> +        if (tb->page_addr[1] == -1) {
> +            return true;
> +        } else {
> +            tb_page_addr_t phys_page2;
> +            target_ulong virt_page2;
> +
> +            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +            phys_page2 = get_page_addr_code(desc->env, virt_page2);
> +            if (tb->page_addr[1] == phys_page2) {
> +                return true;
> +            }
> +        }
> +    }
> +    return false;
> +}
> +
>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>                                            target_ulong pc,
>                                            target_ulong cs_base,
>                                            uint32_t flags)
>  {
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> -    TranslationBlock *tb, **tb_hash_head, **ptb1;
> +    tb_page_addr_t phys_pc;
> +    struct tb_desc desc;
>      uint32_t h;
> -    tb_page_addr_t phys_pc, phys_page1;
>  
> -    /* find translated block using physical mappings */
> -    phys_pc = get_page_addr_code(env, pc);
> -    phys_page1 = phys_pc & TARGET_PAGE_MASK;
> +    desc.env = (CPUArchState *)cpu->env_ptr;
> +    desc.cs_base = cs_base;
> +    desc.flags = flags;
> +    desc.pc = pc;
> +    phys_pc = get_page_addr_code(desc.env, pc);
> +    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -
> -    /* Start at head of the hash entry */
> -    ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb = *ptb1;
> -
> -    while (tb) {
> -        if (tb->pc == pc &&
> -            tb->page_addr[0] == phys_page1 &&
> -            tb->cs_base == cs_base &&
> -            tb->flags == flags) {
> -
> -            if (tb->page_addr[1] == -1) {
> -                /* done, we have a match */
> -                break;
> -            } else {
> -                /* check next page if needed */
> -                target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) +
> -                                          TARGET_PAGE_SIZE;
> -                tb_page_addr_t phys_page2 = get_page_addr_code(env, 
> virt_page2);
> -
> -                if (tb->page_addr[1] == phys_page2) {
> -                    break;
> -                }
> -            }
> -        }
> -
> -        ptb1 = &tb->phys_hash_next;
> -        tb = *ptb1;
> -    }
> -
> -    if (tb) {
> -        /* Move the TB to the head of the list */
> -        *ptb1 = tb->phys_hash_next;
> -        tb->phys_hash_next = *tb_hash_head;
> -        *tb_hash_head = tb;
> -    }
> -    return tb;
> +    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>  
>  static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index e076397..c1f59fa 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -215,8 +215,6 @@ struct TranslationBlock {
>  
>      void *tc_ptr;    /* pointer to the translated code */
>      uint8_t *tc_search;  /* pointer to search data */
> -    /* next matching tb for physical address. */
> -    struct TranslationBlock *phys_hash_next;
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 5efe3d9..e209c1c 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -21,9 +21,10 @@
>  #define QEMU_TB_CONTEXT_H_
>  
>  #include "qemu/thread.h"
> +#include "qemu/qht.h"
>  
> -#define CODE_GEN_PHYS_HASH_BITS     15
> -#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS     15
> +#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
>  
>  typedef struct TranslationBlock TranslationBlock;
>  typedef struct TBContext TBContext;
> @@ -31,7 +32,7 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>  
>      TranslationBlock *tbs;
> -    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> +    struct qht htable;
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 88ccfd1..1d0200b 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>  
> -#include "exec/exec-all.h"
>  #include "exec/tb-hash-xx.h"
>  
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -49,7 +48,7 @@ static inline unsigned int 
> tb_jmp_cache_hash_func(target_ulong pc)
>  static inline
>  uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t 
> flags)
>  {
> -    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags);
>  }
>  
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index d75737c..b620fcc 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>  
> +static void tb_htable_init(void)
> +{
> +    unsigned int mode = QHT_MODE_AUTO_RESIZE;
> +
> +    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
>     (in bytes) allocated to the translation buffer. Zero means default
>     size. */
> @@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
>  {
>      cpu_gen_init();
>      page_init();
> +    tb_htable_init();
>      code_gen_alloc(tb_size);
>  #if defined(CONFIG_SOFTMMU)
>      /* There's no guest base to take into account, so go ahead and
> @@ -846,7 +854,7 @@ void tb_flush(CPUState *cpu)
>          cpu->tb_flushed = true;
>      }
>  
> -    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, 
> sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> +    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>  
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -857,60 +865,46 @@ void tb_flush(CPUState *cpu)
>  
>  #ifdef DEBUG_TB_CHECK
>  
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb;
> -    int i;
> +    TranslationBlock *tb = p;
> +    target_ulong addr = *(target_ulong *)userp;
>  
> -    address &= TARGET_PAGE_MASK;
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -             tb = tb->phys_hash_next) {
> -            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> -                  address >= tb->pc + tb->size)) {
> -                printf("ERROR invalidate: address=" TARGET_FMT_lx
> -                       " PC=%08lx size=%04x\n",
> -                       address, (long)tb->pc, tb->size);
> -            }
> -        }
> +    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> +        printf("ERROR invalidate: address=" TARGET_FMT_lx
> +               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
>      }
>  }
>  
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
>  {
> -    TranslationBlock *tb;
> -    int i, flags1, flags2;
> -
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -                tb = tb->phys_hash_next) {
> -            flags1 = page_get_flags(tb->pc);
> -            flags2 = page_get_flags(tb->pc + tb->size - 1);
> -            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> -                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> -                       (long)tb->pc, tb->size, flags1, flags2);
> -            }
> -        }
> -    }
> +    address &= TARGET_PAGE_MASK;
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
>  }
>  
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock 
> *tb)
> +static void
> +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb1;
> +    TranslationBlock *tb = p;
> +    int flags1, flags2;
>  
> -    for (;;) {
> -        tb1 = *ptb;
> -        if (tb1 == tb) {
> -            *ptb = tb1->phys_hash_next;
> -            break;
> -        }
> -        ptb = &tb1->phys_hash_next;
> +    flags1 = page_get_flags(tb->pc);
> +    flags2 = page_get_flags(tb->pc + tb->size - 1);
> +    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> +        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> +               (long)tb->pc, tb->size, flags1, flags2);
>      }
>  }
>  
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +}
> +
> +#endif
> +
>  static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock 
> *tb)
>  {
>      TranslationBlock *tb1;
> @@ -998,7 +992,7 @@ void tb_phys_invalidate(TranslationBlock *tb, 
> tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> +    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>  
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1128,13 +1122,10 @@ static void tb_link_page(TranslationBlock *tb, 
> tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
>      uint32_t h;
> -    TranslationBlock **ptb;
>  
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb->phys_hash_next = *ptb;
> -    *ptb = tb;
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>  
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht, Igor Mammedov <=
- [Qemu-devel] [PATCH] qht: support resetting an uninitialized qht, Emilio G. Cota, 2016/08/10
  - Re: [Qemu-devel] [PATCH] qht: support resetting an uninitialized qht, Igor Mammedov, 2016/08/11
Prev by Date: Re: [Qemu-devel] [PATCH v1 1/1] block/gluster: fix port type in the QAPI options list
Next by Date: [Qemu-devel] [PATCH] correctly handle resize of empty hash tree
Previous by thread: [Qemu-devel] [PATCH] test-coroutine: Fix coroutine pool corruption
Next by thread: [Qemu-devel] [PATCH] qht: support resetting an uninitialized qht
Index(es):
- Date
- Thread