[Qemu-commits] [qemu/qemu] 101924: tcg/i386: Use byte form of xgetbv ins

qemu-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-commits] [qemu/qemu] 101924: tcg/i386: Use byte form of xgetbv ins

From:	GitHub
Subject:	[Qemu-commits] [qemu/qemu] 101924: tcg/i386: Use byte form of xgetbv instruction
Date:	Fri, 22 Jun 2018 01:57:52 -0700
  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: 1019242af11400252f6735ca71a35f81ac23a66d
      
https://github.com/qemu/qemu/commit/1019242af11400252f6735ca71a35f81ac23a66d
  Author: John Arbuckle <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M tcg/i386/tcg-target.inc.c

  Log Message:
  -----------
  tcg/i386: Use byte form of xgetbv instruction

The assembler in most versions of Mac OS X is pretty old and does not
support the xgetbv instruction.  To go around this problem, the raw
encoding of the instruction is used instead.

Signed-off-by: John Arbuckle <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 61b8cef1d42567d3029e0c7180cbd0f16cc4be2d
      
https://github.com/qemu/qemu/commit/61b8cef1d42567d3029e0c7180cbd0f16cc4be2d
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M include/qemu/qht.h
    M tests/qht-bench.c
    M tests/test-qht.c
    M util/qht.c

  Log Message:
  -----------
  qht: require a default comparison function

qht_lookup now uses the default cmp function. qht_lookup_custom is defined
to retain the old behaviour, that is a cmp function is explicitly provided.

qht_insert will gain use of the default cmp in the next patch.

Note that we move qht_lookup_custom's @func to be the last argument,
which makes the new qht_lookup as simple as possible.
Instead of this (i.e. keeping @func 2nd):
0000000000010750 <qht_lookup>:
   10750:       89 d1                   mov    %edx,%ecx
   10752:       48 89 f2                mov    %rsi,%rdx
   10755:       48 8b 77 08             mov    0x8(%rdi),%rsi
   10759:       e9 22 ff ff ff          jmpq   10680 <qht_lookup_custom>
   1075e:       66 90                   xchg   %ax,%ax

We get:
0000000000010740 <qht_lookup>:
   10740:       48 8b 4f 08             mov    0x8(%rdi),%rcx
   10744:       e9 37 ff ff ff          jmpq   10680 <qht_lookup_custom>
   10749:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 32359d529f30bea8124ed671b2e6a22f22540488
      
https://github.com/qemu/qemu/commit/32359d529f30bea8124ed671b2e6a22f22540488
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c
    M include/qemu/qht.h
    M tests/qht-bench.c
    M tests/test-qht.c
    M util/qht.c

  Log Message:
  -----------
  qht: return existing entry when qht_insert fails

The meaning of "existing" is now changed to "matches in hash and
ht->cmp result". This is saner than just checking the pointer value.

Suggested-by: Richard Henderson <address@hidden>
Reviewed-by:  Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: be2cdc5e352eb28b4ff631f053a261d91e6af78e
      
https://github.com/qemu/qemu/commit/be2cdc5e352eb28b4ff631f053a261d91e6af78e
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M include/exec/exec-all.h
    M include/exec/tb-context.h
    M tcg/tcg.c
    M tcg/tcg.h

  Log Message:
  -----------
  tcg: track TBs with per-region BST's

This paves the way for enabling scalable parallel generation of TCG code.

Instead of tracking TBs with a single binary search tree (BST), use a
BST for each TCG region, protecting it with a lock. This is as scalable
as it gets, since each TCG thread operates on a separate region.

The core of this change is the introduction of struct tcg_region_tree,
which contains a pointer to a GTree and an associated lock to serialize
accesses to it. We then allocate an array of tcg_region_tree's, adding
the appropriate padding to avoid false sharing based on
qemu_dcache_linesize.

Given a tc_ptr, we first find the corresponding region_tree. This
is done by special-casing the first and last regions first, since they
might be of size != region.size; otherwise we just divide the offset
by region.stride. I was worried about this division (several dozen
cycles of latency), but profiling shows that this is not a fast path.
Note that region.stride is not required to be a power of two; it
is only required to be a multiple of the host's page size.

Note that with this design we can also provide consistent snapshots
about all region trees at once; for instance, tcg_tb_foreach
acquires/releases all region_tree locks before/after iterating over them.
For this reason we now drop tb_lock in dump_exec_info().

As an alternative I considered implementing a concurrent BST, but this
can be tricky to get right, offers no consistent snapshots of the BST,
and performance and scalability-wise I don't think it could ever beat
having separate GTrees, given that our workload is insert-mostly (all
concurrent BST designs I've seen focus, understandably, on making
lookups fast, which comes at the expense of convoluted, non-wait-free
insertions/removals).

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 128ed2278c4e6ad063f101c5dda7999b43f2d8a3
      
https://github.com/qemu/qemu/commit/128ed2278c4e6ad063f101c5dda7999b43f2d8a3
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c
    M include/exec/tb-context.h
    M tcg/tcg.c
    M tcg/tcg.h

  Log Message:
  -----------
  tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx

Thereby making it per-TCGContext. Once we remove tb_lock, this will
avoid an atomic increment every time a TB is invalidated.

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 1e05197f24c49d52f339de9053bb1d17082f1be3
      
https://github.com/qemu/qemu/commit/1e05197f24c49d52f339de9053bb1d17082f1be3
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c
    M include/exec/exec-all.h

  Log Message:
  -----------
  translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB

This commit does several things, but to avoid churn I merged them all
into the same commit. To wit:

- Use uintptr_t instead of TranslationBlock * for the list of TBs in a page.
  Just like we did in (c37e6d7e "tcg: Use uintptr_t type for
  jmp_list_{next|first} fields of TB"), the rationale is the same: these
  are tagged pointers, not pointers. So use a more appropriate type.

- Only check the least significant bit of the tagged pointers. Masking
  with 3/~3 is unnecessary and confusing.

- Introduce the TB_FOR_EACH_TAGGED macro, and use it to define
  PAGE_FOR_EACH_TB, which improves readability. Note that
  TB_FOR_EACH_TAGGED will gain another user in a subsequent patch.

- Update tb_page_remove to use PAGE_FOR_EACH_TB. In case there
  is a bug and we attempt to remove a TB that is not in the list, instead
  of segfaulting (since the list is NULL-terminated) we will reach
  g_assert_not_reached().

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 78722ed0b826644ae240e3c0bbb6bdde02dfe7e1
      
https://github.com/qemu/qemu/commit/78722ed0b826644ae240e3c0bbb6bdde02dfe7e1
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c
    M docs/devel/multi-thread-tcg.txt

  Log Message:
  -----------
  translate-all: make l1_map lockless

Groundwork for supporting parallel TCG generation.

We never remove entries from the radix tree, so we can use cmpxchg
to implement lockless insertions.

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 94da9aec2a50f0c82e6c60939275c0337f03d5fe
      
https://github.com/qemu/qemu/commit/94da9aec2a50f0c82e6c60939275c0337f03d5fe
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c

  Log Message:
  -----------
  translate-all: remove hole in PageDesc

Groundwork for supporting parallel TCG generation.

Move the hole to the end of the struct, so that a u32
field can be added there without bloating the struct.

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: ae5486e273a4e368515a963a6d0076e20453eb72
      
https://github.com/qemu/qemu/commit/ae5486e273a4e368515a963a6d0076e20453eb72
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c

  Log Message:
  -----------
  translate-all: work page-by-page in tb_invalidate_phys_range_1

So that we pass a same-page range to tb_invalidate_phys_page_range,
instead of always passing an end address that could be on a different
page.

As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range
doesn't actually do much with 'end', which explains why we have never
hit a bug despite going against what the comment on top of
tb_invalidate_phys_page_range requires:

> * Invalidate all TBs which intersect with the target physical address range
> * [start;end[. NOTE: start and end must refer to the *same* physical page.

The appended honours the comment, which avoids confusion.

While at it, rework the loop into a for loop, which is less error prone
(e.g. "continue" won't result in an infinite loop).

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.html

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 45c73de594414904b0d6a7ade70fb4514d35f79c
      
https://github.com/qemu/qemu/commit/45c73de594414904b0d6a7ade70fb4514d35f79c
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c

  Log Message:
  -----------
  translate-all: move tb_invalidate_phys_page_range up in the file

This greatly simplifies next commit's diff.

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 0b5c91f74f3c83a36f37740969df8c775c997e69
      
https://github.com/qemu/qemu/commit/0b5c91f74f3c83a36f37740969df8c775c997e69
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c
    M accel/tcg/translate-all.h
    M include/exec/exec-all.h

  Log Message:
  -----------
  translate-all: use per-page locking in !user-mode

Groundwork for supporting parallel TCG generation.

Instead of using a global lock (tb_lock) to protect changes
to pages, use fine-grained, per-page locks in !user-mode.
User-mode stays with mmap_lock.

Sometimes changes need to happen atomically on more than one
page (e.g. when a TB that spans across two pages is
added/invalidated, or when a range of pages is invalidated).
We therefore introduce struct page_collection, which helps
us keep track of a set of pages that have been locked in
the appropriate locking order (i.e. by ascending page index).

This commit first introduces the structs and the function helpers,
to then convert the calling code to use per-page locking. Note
that tb_lock is not removed yet.

While at it, rename tb_alloc_page to tb_page_add, which pairs with
tb_page_remove.

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 6d9abf85d538731ccff25fc29d7fa938115b1a80
      
https://github.com/qemu/qemu/commit/6d9abf85d538731ccff25fc29d7fa938115b1a80
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c

  Log Message:
  -----------
  translate-all: add page_locked assertions

This is only compiled under CONFIG_DEBUG_TCG to avoid
bloating the binary.

In user-mode, assert_page_locked is equivalent to assert_mmap_lock.

Note: There are some tb_lock assertions left that will be
removed by later patches.

Reviewed-by: Richard Henderson <address@hidden>
Suggested-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: faa9372c07d062eb01f9da72e3f6c0f32efffca7
      
https://github.com/qemu/qemu/commit/faa9372c07d062eb01f9da72e3f6c0f32efffca7
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M include/exec/exec-all.h

  Log Message:
  -----------
  translate-all: introduce assert_no_pages_locked

The appended adds assertions to make sure we do not longjmp with page
locks held. Note that user-mode has nothing to check, since page_locks
are !user-mode only.

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 95590e24af11236ef334f6bc3e2b71404a790ddb
      
https://github.com/qemu/qemu/commit/95590e24af11236ef334f6bc3e2b71404a790ddb
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M docs/devel/multi-thread-tcg.txt

  Log Message:
  -----------
  translate-all: discard TB when tb_link_page returns an existing matching TB

Use the recently-gained QHT feature of returning the matching TB if it
already exists. This allows us to get rid of the lookup we perform
right after acquiring tb_lock.

Suggested-by: Richard Henderson <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 194125e3ebd553acb02aaf3797a4f0387493fe94
      
https://github.com/qemu/qemu/commit/194125e3ebd553acb02aaf3797a4f0387493fe94
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M docs/devel/multi-thread-tcg.txt
    M include/exec/exec-all.h

  Log Message:
  -----------
  translate-all: protect TB jumps with a per-destination-TB lock

This applies to both user-mode and !user-mode emulation.

Instead of relying on a global lock, protect the list of incoming
jumps with tb->jmp_lock. This lock also protects tb->cflags,
so update all tb->cflags readers outside tb->jmp_lock to use
atomic reads via tb_cflags().

In order to find the destination TB (and therefore its jmp_lock)
from the origin TB, we introduce tb->jmp_dest[].

I considered not using a linked list of jumps, which simplifies
code and makes the struct smaller. However, it unnecessarily increases
memory usage, which results in a performance decrease. See for
instance these numbers booting+shutting down debian-arm:
                Time (s)  Rel. err (%)  Abs. err (s)  Rel. slowdown (%)
------------------------------------------------------------------------------
 before                  20.88          0.74      0.154512                 0.
 after                   20.81          0.38      0.079078        -0.33524904
 GTree                   21.02          0.28      0.058856         0.67049808
 GHashTable + xxhash     21.63          1.08      0.233604          3.5919540

Using a hash table or a binary tree to keep track of the jumps
doesn't really pay off, not only due to the increased memory usage,
but also because most TBs have only 0 or 1 jumps to them. The maximum
number of jumps when booting debian-arm that I measured is 35, but
as we can see in the histogram below a TB with that many incoming jumps
is extremely rare; the average TB has 0.80 incoming jumps.

n_jumps: 379208; avg jumps/tb: 0.801099
dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: b7542f7fe8f879b7b1e74f5fbd36b5746dbb6712
      
https://github.com/qemu/qemu/commit/b7542f7fe8f879b7b1e74f5fbd36b5746dbb6712
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cputlb.c

  Log Message:
  -----------
  cputlb: remove tb_lock from tlb_flush functions

The acquisition of tb_lock was added when the async tlb_flush
was introduced in e3b9ca810 ("cputlb: introduce tlb_flush_* async work.")

tb_lock was there to allow us to do memset() on the tb_jmp_cache's.
However, since f3ced3c5928 ("tcg: consistently access cpu->tb_jmp_cache
atomically") all accesses to tb_jmp_cache are atomic, so tb_lock
is not needed here. Get rid of it.

Reviewed-by: Alex Bennée <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 705ad1ff0ce264475cb4c9a3aa31ba94a04869fe
      
https://github.com/qemu/qemu/commit/705ad1ff0ce264475cb4c9a3aa31ba94a04869fe
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/translate-all.c

  Log Message:
  -----------
  translate-all: remove tb_lock mention from cpu_restore_state_from_tb

tb_lock was needed when the function did retranslation. However,
since fca8a500d519 ("tcg: Save insn data and use it in
cpu_restore_state_from_tb") we don't do retranslation.

Get rid of the comment.

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 0ac20318ce16f4de288969b2007ef5a654176058
      
https://github.com/qemu/qemu/commit/0ac20318ce16f4de288969b2007ef5a654176058
  Author: Emilio G. Cota <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/translate-all.c
    M accel/tcg/translate-all.h
    M docs/devel/multi-thread-tcg.txt
    M exec.c
    M include/exec/cpu-common.h
    M include/exec/exec-all.h
    M include/exec/memory-internal.h
    M include/exec/tb-context.h
    M linux-user/main.c
    M tcg/tcg.h

  Log Message:
  -----------
  tcg: remove tb_lock

Use mmap_lock in user-mode to protect TCG state and the page descriptors.
In !user-mode, each vCPU has its own TCG state, so no locks needed.
Per-page locks are used to protect the page descriptors.

Per-TB locks are used in both modes to protect TB jumps.

Some notes:

- tb_lock is removed from notdirty_mem_write by passing a
  locked page_collection to tb_invalidate_phys_page_fast.

- tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
  so there is no need to further serialize access to them.

- do_tb_flush is run in a safe async context, meaning no other
  vCPU threads are running. Therefore acquiring mmap_lock there
  is just to please tools such as thread sanitizer.

- Not visible in the diff, but tb_invalidate_phys_page already
  has an assert_memory_lock.

- cpu_io_recompile is !user-only, so no mmap_lock there.

- Added mmap_unlock()'s before all siglongjmp's that could
  be called in user-mode while mmap_lock is held.
  + Added an assert for !have_mmap_lock() after returning from
    the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.

Performance numbers before/after:

Host: AMD Opteron(tm) Processor 6376
            ubuntu 17.04 ppc64 bootup+shutdown time

  700 +-+--+----+------+------------+-----------+------------*--+-+
      |    +    +      +            +           +           *B    |
      |         before ***B***                            ** *    |
      |tb lock removal ###D###                         ***        |
  600 +-+                                           ***         +-+
      |                                           **         #    |
      |                                        *B*          #D    |
      |                                     *** *         ##      |
  500 +-+                                ***           ###      +-+
      |                             * ***           ###           |
      |                            *B*          # ##              |
      |                          ** *          #D#                |
  400 +-+                      **            ##                 +-+
      |                      **           ###                     |
      |                    **           ##                        |
      |                  **         # ##                          |
  300 +-+  *           B*          #D#                          +-+
      |    B         ***        ###                               |
      |    *       **       ####                                  |
      |     *   ***      ###                                      |
  200 +-+   B  *B     #D#                                       +-+
      |     #B* *   ## #                                          |
      |     #*    ##                                              |
      |    + D##D#     +            +           +            +    |
  100 +-+--+----+------+------------+-----------+------------+--+-+
     1    8      16      Guest CPUs       48           64
  png: https://imgur.com/HwmBHXe
         debian jessie aarch64 bootup+shutdown time

  90 +-+--+-----+-----+------------+------------+------------+--+-+
     |    +     +     +            +            +            +    |
     |         before ***B***                                B    |
  80 +tb lock removal ###D###                              **D  +-+
     |                                                   **###    |
     |                                                 **##       |
  70 +-+                                             ** #       +-+
     |                                             ** ##          |
     |                                           **  #            |
  60 +-+                                       *B  ##           +-+
     |                                       **  ##               |
     |                                    ***  #D                 |
  50 +-+                               ***   ##                 +-+
     |                             * **   ###                     |
     |                           **B*  ###                        |
  40 +-+                     ****  # ##                         +-+
     |                   ****     #D#                             |
     |             ***B**      ###                                |
  30 +-+    B***B**        ####                                 +-+
     |    B *   *     # ###                                       |
     |     B       ###D#                                          |
  20 +-+   D  ##D##                                             +-+
     |      D#                                                    |
     |    +     +     +            +            +            +    |
  10 +-+--+-----+-----+------------+------------+------------+--+-+
    1     8     16      Guest CPUs        48           64
  png: https://imgur.com/iGpGFtv

The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
lock contention significantly hurts scalability.

Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 9f754620651d3432114f4bb89c7f12cbea814b3e
      
https://github.com/qemu/qemu/commit/9f754620651d3432114f4bb89c7f12cbea814b3e
  Author: Richard Henderson <address@hidden>
  Date:   2018-06-15 (Fri, 15 Jun 2018)

  Changed paths:
    M tcg/aarch64/tcg-target.inc.c
    M tcg/arm/tcg-target.inc.c
    M tcg/i386/tcg-target.inc.c
    M tcg/mips/tcg-target.inc.c
    M tcg/ppc/tcg-target.inc.c
    M tcg/s390/tcg-target.inc.c
    M tcg/sparc/tcg-target.inc.c
    M tcg/tcg.c
    M tcg/tcg.h
    M tcg/tci/tcg-target.inc.c

  Log Message:
  -----------
  tcg: Reduce max TB opcode count

Also, assert that we don't overflow any of two different offsets into
the TB. Both unwind and goto_tb both record a uint16_t for later use.

This fixes an arm-softmmu test case utilizing NEON in which there is
a TB generated that runs to 7800 opcodes, and compiles to 96k on an
x86_64 host.  This overflows the 16-bit offset in which we record the
goto_tb reset offset.  Because of that overflow, we install a jump
destination that goes to neverland.  Boom.

With this reduced op count, the same TB compiles to about 48k for
aarch64, ppc64le, and x86_64 hosts, and neither assertion fires.

Cc: address@hidden
Reported-by: "Jason A. Donenfeld" <address@hidden>
Reviewed-by: Philippe Mathieu-Daudé <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 33836a731562e3d07b3a83f26e81c6b1482d216c
      
https://github.com/qemu/qemu/commit/33836a731562e3d07b3a83f26e81c6b1482d216c
  Author: Peter Maydell <address@hidden>
  Date:   2018-06-21 (Thu, 21 Jun 2018)

  Changed paths:
    M accel/tcg/cpu-exec.c
    M accel/tcg/cputlb.c
    M accel/tcg/translate-all.c
    M accel/tcg/translate-all.h
    M docs/devel/multi-thread-tcg.txt
    M exec.c
    M include/exec/cpu-common.h
    M include/exec/exec-all.h
    M include/exec/memory-internal.h
    M include/exec/tb-context.h
    M include/qemu/qht.h
    M linux-user/main.c
    M tcg/aarch64/tcg-target.inc.c
    M tcg/arm/tcg-target.inc.c
    M tcg/i386/tcg-target.inc.c
    M tcg/mips/tcg-target.inc.c
    M tcg/ppc/tcg-target.inc.c
    M tcg/s390/tcg-target.inc.c
    M tcg/sparc/tcg-target.inc.c
    M tcg/tcg.c
    M tcg/tcg.h
    M tcg/tci/tcg-target.inc.c
    M tests/qht-bench.c
    M tests/test-qht.c
    M util/qht.c

  Log Message:
  -----------
  Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20180615' into staging

TCG patch queue:

Workaround macos assembler lossage.
Eliminate tb_lock.
Fix TB code generation overflow.

# gpg: Signature made Fri 15 Jun 2018 20:40:56 BST
# gpg:                using RSA key 64DF38E8AF7E215F
# gpg: Good signature from "Richard Henderson <address@hidden>"
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A  05C0 64DF 38E8 AF7E 215F

* remotes/rth/tags/pull-tcg-20180615:
  tcg: Reduce max TB opcode count
  tcg: remove tb_lock
  translate-all: remove tb_lock mention from cpu_restore_state_from_tb
  cputlb: remove tb_lock from tlb_flush functions
  translate-all: protect TB jumps with a per-destination-TB lock
  translate-all: discard TB when tb_link_page returns an existing matching TB
  translate-all: introduce assert_no_pages_locked
  translate-all: add page_locked assertions
  translate-all: use per-page locking in !user-mode
  translate-all: move tb_invalidate_phys_page_range up in the file
  translate-all: work page-by-page in tb_invalidate_phys_range_1
  translate-all: remove hole in PageDesc
  translate-all: make l1_map lockless
  translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB
  tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
  tcg: track TBs with per-region BST's
  qht: return existing entry when qht_insert fails
  qht: require a default comparison function
  tcg/i386: Use byte form of xgetbv instruction

Signed-off-by: Peter Maydell <address@hidden>


Compare: https://github.com/qemu/qemu/compare/46012db66699...33836a731562
      **NOTE:** This service been marked for deprecation: 
https://developer.github.com/changes/2018-04-25-github-services-deprecation/

      Functionality will be removed from GitHub.com on January 31st, 2019.
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-commits] [qemu/qemu] 101924: tcg/i386: Use byte form of xgetbv instruction, GitHub <=
Prev by Date: [Qemu-commits] [qemu/qemu] 7a5342: virtio-ccw: clean up notify
Next by Date: [Qemu-commits] [qemu/qemu] ac5de4: tests: Simplify .gitignore
Previous by thread: [Qemu-commits] [qemu/qemu] 7a5342: virtio-ccw: clean up notify
Next by thread: [Qemu-commits] [qemu/qemu] ac5de4: tests: Simplify .gitignore
Index(es):
- Date
- Thread