qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line


From: Emilio G. Cota
Subject: Re: [Qemu-devel] [PATCH for-4.0 00/17] tcg: Move softmmu out-of-line
Date: Thu, 15 Nov 2018 20:13:38 -0500
User-agent: Mutt/1.9.4 (2018-02-28)

On Thu, Nov 15, 2018 at 23:04:50 +0100, Richard Henderson wrote:
> On 11/15/18 7:48 PM, Emilio G. Cota wrote:
> > - Segfault in code_gen_buffer. This one I don't have a fix for,
> >   but it's *much* easier to reproduce when -tb-size is very small,
> >   e.g. "-tb-size 5 -smp 2" (BTW it crashes with x86_64 guests too.)
> >   So at first I thought the code cache flushing was the problem,
> >   but I don't see how that could be, at least from a TCGContext
> >   viewpoint -- I agree that clearing the hash table in
> >   tcg_region_assign is a good place to do so.
> 
> Ho hum.
> 
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 639f0b2728..115ea186e5 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1831,10 +1831,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>      existing_tb = tb_link_page(tb, phys_pc, phys_page2);
>      /* if the TB already exists, discard what we just translated */
>      if (unlikely(existing_tb != tb)) {
> -        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
> -
> -        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> -        atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
>          return existing_tb;
>      }
>      tcg_tb_insert(tb);
> 
> We can't easily undo the hash table insert, and for a relatively rare
> occurrence it's not worth the effort.

Nice catch! Everything works now =D

In the bootup+shutdown aarch64 test with -smp 12, we end up
discarding ~2500 TB's--that's ~439K of space for code that we
do not waste; note that I'm assuming 180 host bytes per TB,
which is the average reported by info jit.

We can still discard most of these by increasing a counter every
time we insert a new element into the OOL table, and checking
this counter before/after tcg_gen_code. (Note that checking
g_hash_table_size before/after is not enough, because we might
have replaced an existing item from the table.)
Then, we discard a TB iff an OOL thunk was generated. (Diff below.)

This allows us to discard most TBs; in the example above,
we end up *not* discarding only ~70 TBs, that is we end up keeping
only 70/2500 = 2.8% of the TBs that we'd discard without OOL.

Performance-wise it doesn't make a difference for -smp 1:

Host: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (5 runs):

- Before (3.1.0-rc1):

      14351.436177      task-clock (msec)         #    0.998 CPUs utilized      
      ( +-  0.24% )
    49,963,260,126      cycles                    #    3.481 GHz                
      ( +-  0.22% )  (83.32%)
    26,047,650,654      stalled-cycles-frontend   #   52.13% frontend cycles 
idle     ( +-  0.29% )  (83.34%)
    19,717,480,482      stalled-cycles-backend    #   39.46% backend  cycles 
idle     ( +-  0.27% )  (66.67%)
    59,278,011,067      instructions              #    1.19  insns per cycle    
    
                                                  #    0.44  stalled cycles per 
insn  ( +-  0.17% )  (83.34%)
    10,632,601,608      branches                  #  740.874 M/sec              
      ( +-  0.17% )  (83.34%)
       236,153,469      branch-misses             #    2.22% of all branches    
      ( +-  0.16% )  (83.35%)

      14.382847823 seconds time elapsed                                         
 ( +-  0.25% )

- After this series (with the fixes we've discussed):

      13256.198927      task-clock (msec)         #    0.998 CPUs utilized      
      ( +-  0.04% )
    46,146,457,353      cycles                    #    3.481 GHz                
      ( +-  0.08% )  (83.34%)
    22,632,342,565      stalled-cycles-frontend   #   49.04% frontend cycles 
idle     ( +-  0.12% )  (83.35%)
    16,534,690,741      stalled-cycles-backend    #   35.83% backend  cycles 
idle     ( +-  0.15% )  (66.67%)
    58,047,832,548      instructions              #    1.26  insns per cycle    
    
                                                  #    0.39  stalled cycles per 
insn  ( +-  0.18% )  (83.34%)
    11,031,634,880      branches                  #  832.187 M/sec              
      ( +-  0.12% )  (83.33%)
       210,593,929      branch-misses             #    1.91% of all branches    
      ( +-  0.30% )  (83.33%)

      13.285023783 seconds time elapsed                                         
 ( +-  0.05% )

- After the fixup below:

      13240.889734      task-clock (msec)         #    0.998 CPUs utilized      
      ( +-  0.19% )
    46,074,292,775      cycles                    #    3.480 GHz                
      ( +-  0.12% )  (83.35%)
    22,670,132,770      stalled-cycles-frontend   #   49.20% frontend cycles 
idle     ( +-  0.17% )  (83.35%)
    16,598,822,504      stalled-cycles-backend    #   36.03% backend  cycles 
idle     ( +-  0.26% )  (66.66%)
    57,796,083,344      instructions              #    1.25  insns per cycle    
    
                                                  #    0.39  stalled cycles per 
insn  ( +-  0.16% )  (83.34%)
    11,002,340,174      branches                  #  830.937 M/sec              
      ( +-  0.11% )  (83.35%)
       211,023,549      branch-misses             #    1.92% of all branches    
      ( +-  0.22% )  (83.32%)

      13.264499034 seconds time elapsed                                         
 ( +-  0.19% )

I'll generate now some more perf numbers that we could include in the
commit logs.

Thanks,

                Emilio

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 115ea18..15f7d4e 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1678,6 +1678,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size;
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    size_t n_ool_thunks;
+#endif
 #ifdef CONFIG_PROFILER
     TCGProfile *prof = &tcg_ctx->prof;
     int64_t ti;
@@ -1744,6 +1747,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     ti = profile_getclock();
 #endif
 
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+    n_ool_thunks = tcg_ctx->n_ool_thunks;
+#endif
+
     /* ??? Overflow could be handled better here.  In particular, we
        don't need to re-do gen_intermediate_code, nor should we re-do
        the tcg optimization currently hidden inside tcg_gen_code.  All
@@ -1831,6 +1838,18 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     existing_tb = tb_link_page(tb, phys_pc, phys_page2);
     /* if the TB already exists, discard what we just translated */
     if (unlikely(existing_tb != tb)) {
+        bool discard = true;
+
+#ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
+        /* only discard the TB if we didn't generate an OOL thunk */
+        discard = tcg_ctx->n_ool_thunks == n_ool_thunks;
+#endif
+        if (discard) {
+            uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+            orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+            atomic_set(&tcg_ctx->code_gen_ptr, (void *)orig_aligned);
+        }
         return existing_tb;
     }
     tcg_tb_insert(tb);
diff --git a/tcg/tcg-ldst-ool.inc.c b/tcg/tcg-ldst-ool.inc.c
index 8fb6550..61da060 100644
--- a/tcg/tcg-ldst-ool.inc.c
+++ b/tcg/tcg-ldst-ool.inc.c
@@ -69,6 +69,7 @@ static bool tcg_out_ldst_ool_finalize(TCGContext *s)
 
         /* Remember the thunk for next time.  */
         g_hash_table_replace(s->ldst_ool_thunks, key, dest);
+        s->n_ool_thunks++;
 
         /* The new thunk must be in range.  */
         ok = patch_reloc(lb->label, lb->reloc, (intptr_t)dest, lb->addend);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1255d2a..d4f07a6 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -709,6 +709,7 @@ struct TCGContext {
 #ifdef TCG_TARGET_NEED_LDST_OOL_LABELS
     QSIMPLEQ_HEAD(ldst_labels, TCGLabelQemuLdstOol) ldst_ool_labels;
     GHashTable *ldst_ool_thunks;
+    size_t n_ool_thunks;
 #endif
 #ifdef TCG_TARGET_NEED_POOL_LABELS
     struct TCGLabelPoolData *pool_labels;




reply via email to

[Prev in Thread] Current Thread [Next in Thread]