[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during cod
From: |
Emilio G. Cota |
Subject: |
[Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu |
Date: |
Sun, 9 Jul 2017 03:50:14 -0400 |
Each vCPU can now generate code with TCG in parallel. Thus,
drop tb_lock around code generation in softmmu.
Note that we still have to take tb_lock after code translation,
since there is global state that we have to update.
Nonetheless holding tb_lock for less time provides significant performance
improvements to workloads that are translation-heavy. A good example
of this is booting Linux; in my measurements, bootup+shutdown time of
debian-arm is reduced by 20% before/after this entire patchset, when
using -smp 8 and MTTCG on a machine with >= 8 real cores:
Host: Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
Performance counter stats for 'qemu/build/arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=foobar.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel /foobar.img -append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 8' (3 runs):
Before:
28764.018852 task-clock # 1.663 CPUs utilized
( +- 0.30% )
727,490 context-switches # 0.025 M/sec
( +- 0.68% )
2,429 CPU-migrations # 0.000 M/sec
( +- 11.36% )
14,042 page-faults # 0.000 M/sec
( +- 1.00% )
70,644,349,920 cycles # 2.456 GHz
( +- 0.96% ) [83.42%]
37,129,806,098 stalled-cycles-frontend # 52.56% frontend cycles idle
( +- 1.27% ) [83.20%]
26,620,190,524 stalled-cycles-backend # 37.68% backend cycles idle
( +- 1.29% ) [66.50%]
85,528,287,892 instructions # 1.21 insns per cycle
# 0.43 stalled cycles per insn
( +- 0.62% ) [83.40%]
14,417,482,689 branches # 501.233 M/sec
( +- 0.49% ) [83.36%]
321,182,192 branch-misses # 2.23% of all branches
( +- 1.17% ) [83.53%]
17.297750583 seconds time elapsed
( +- 1.08% )
After:
28690.888633 task-clock # 2.069 CPUs utilized
( +- 1.54% )
473,947 context-switches # 0.017 M/sec
( +- 1.32% )
2,793 CPU-migrations # 0.000 M/sec
( +- 18.74% )
22,634 page-faults # 0.001 M/sec
( +- 1.20% )
69,314,663,510 cycles # 2.416 GHz
( +- 1.08% ) [83.50%]
36,114,710,208 stalled-cycles-frontend # 52.10% frontend cycles idle
( +- 1.64% ) [83.26%]
25,519,842,658 stalled-cycles-backend # 36.82% backend cycles idle
( +- 1.70% ) [66.77%]
84,588,443,638 instructions # 1.22 insns per cycle
# 0.43 stalled cycles per insn
( +- 0.78% ) [83.44%]
14,258,100,183 branches # 496.956 M/sec
( +- 0.87% ) [83.32%]
324,984,804 branch-misses # 2.28% of all branches
( +- 0.51% ) [83.17%]
13.870347754 seconds time elapsed
( +- 1.65% )
That is, a speedup of 17.29/13.87=1.24X.
Similar numbers on a slower machine:
Host: AMD Opteron(tm) Processor 6376:
Before:
74765.850569 task-clock (msec) # 1.956 CPUs utilized
( +- 1.42% )
841,430 context-switches # 0.011 M/sec
( +- 2.50% )
18,228 cpu-migrations # 0.244 K/sec
( +- 2.87% )
26,565 page-faults # 0.355 K/sec
( +- 9.19% )
98,775,815,944 cycles # 1.321 GHz
( +- 1.40% ) (83.44%)
26,325,365,757 stalled-cycles-frontend # 26.65% frontend cycles
idle ( +- 1.96% ) (83.26%)
17,270,620,447 stalled-cycles-backend # 17.48% backend cycles
idle ( +- 3.45% ) (33.32%)
82,998,905,540 instructions # 0.84 insns per cycle
# 0.32 stalled cycles per
insn ( +- 0.71% ) (50.06%)
14,209,593,402 branches # 190.055 M/sec
( +- 1.01% ) (66.74%)
571,258,648 branch-misses # 4.02% of all branches
( +- 0.20% ) (83.40%)
38.220740889 seconds time elapsed
( +- 0.72% )
After:
73281.226761 task-clock (msec) # 2.415 CPUs utilized
( +- 0.29% )
571,984 context-switches # 0.008 M/sec
( +- 1.11% )
14,301 cpu-migrations # 0.195 K/sec
( +- 2.90% )
42,635 page-faults # 0.582 K/sec
( +- 7.76% )
98,478,185,775 cycles # 1.344 GHz
( +- 0.32% ) (83.39%)
25,555,945,935 stalled-cycles-frontend # 25.95% frontend cycles
idle ( +- 0.47% ) (83.37%)
15,174,223,390 stalled-cycles-backend # 15.41% backend cycles
idle ( +- 0.83% ) (33.26%)
81,939,511,983 instructions # 0.83 insns per cycle
# 0.31 stalled cycles per
insn ( +- 0.12% ) (49.95%)
13,992,075,918 branches # 190.937 M/sec
( +- 0.16% ) (66.65%)
580,790,655 branch-misses # 4.15% of all branches
( +- 0.20% ) (83.26%)
30.340574988 seconds time elapsed
( +- 0.39% )
That is, a speedup of 1.25X.
Signed-off-by: Emilio G. Cota <address@hidden>
---
accel/tcg/cpu-exec.c | 7 ++++++-
accel/tcg/translate-all.c | 22 ++++++++++++++++++++++
2 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 54ecae2..2b34d58 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -351,6 +351,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
* single threaded the locks are NOPs.
*/
mmap_lock();
+#ifdef CONFIG_USER_ONLY
tb_lock();
have_tb_lock = true;
@@ -362,7 +363,11 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
/* if no translated code available, then translate it now */
tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
}
-
+#else
+ tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+ /* tb_gen_code returns with tb_lock acquired */
+ have_tb_lock = true;
+#endif
mmap_unlock();
}
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 17b18a9..6cab609 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -887,7 +887,9 @@ static TranslationBlock *tb_alloc(target_ulong pc)
{
TranslationBlock *tb;
+#ifdef CONFIG_USER_ONLY
assert_tb_locked();
+#endif
tb = tcg_tb_alloc(&tcg_ctx);
if (unlikely(tb == NULL)) {
@@ -1314,7 +1316,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
TCGProfile *prof = &tcg_ctx.prof;
int64_t ti;
#endif
+#ifdef CONFIG_USER_ONLY
assert_memory_lock();
+#endif
phys_pc = get_page_addr_code(env, pc);
if (use_icount && !(cflags & CF_IGNORE_ICOUNT)) {
@@ -1430,6 +1434,24 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
if ((pc & TARGET_PAGE_MASK) != virt_page2) {
phys_page2 = get_page_addr_code(env, virt_page2);
}
+ if (!have_tb_lock) {
+ TranslationBlock *t;
+
+ tb_lock();
+ /*
+ * There's a chance that our desired tb has been translated while
+ * we were translating it.
+ */
+ t = tb_htable_lookup(cpu, pc, cs_base, flags);
+ if (unlikely(t)) {
+ /* discard what we just translated */
+ uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+ orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+ atomic_set(&tcg_ctx.code_gen_ptr, orig_aligned);
+ return t;
+ }
+ }
/* As long as consistency of the TB stuff is provided by tb_lock in user
* mode and is implicit in single-threaded softmmu emulation, no explicit
* memory barrier is required before tb_link_page() makes the TB visible
--
2.7.4
- Re: [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext, (continued)
- [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu,
Emilio G. Cota <=
[Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs, Emilio G. Cota, 2017/07/09
[Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static, Emilio G. Cota, 2017/07/09