qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump
Date: Wed, 10 Oct 2012 16:28:23 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Oct 10, 2012 at 03:21:48PM +0200, Laurent Desnogues wrote:
> On Tue, Oct 9, 2012 at 10:30 PM, Aurelien Jarno <address@hidden> wrote:
> > Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
> > need to flush the icache on TB linking, and allow to remove the limit
> > on the code generation buffer.
> 
> I'm not sure I like it.  In general having data in the middle
> of code will increase I/D cache and I/D TLB pressure.

Agreed. On the other hand, this patch remove the synchronization of
the instruction cache for TB linking/unlinking.

> > This improves the boot-up speed of a MIPS guest by 11%.
> 
> Boot speed is very specific.  Did you test some other code?
> Also what was your host?

I tested it on a Cortex-A8 machine. I have only tested MIPS, but I can
do more tests, like running the openssl testsuite in the emulated guest.

> Testing on a quad core Cortex-A9, using all of your patches
> (including TCG optimizations), I get this running nbench i386
> in user mode:
> 
> TEST                : Iter/sec.  : Old Index  : New Index
>                     :            : Pentium 90 : AMD K6/233
> --------------------:------------:------------:-----------
> NUMERIC SORT        :     119.48 :       3.06 :       1.01
> STRING SORT         :     7.7907 :       3.48 :       0.54
> BITFIELD            : 2.2049e+07 :       3.78 :       0.79
> FP EMULATION        :      5.094 :       2.44 :       0.56
> FOURIER             :     483.73 :       0.55 :       0.31
> ASSIGNMENT          :      1.778 :       6.77 :       1.75
> IDEA                :     341.43 :       5.22 :       1.55
> HUFFMAN             :     45.942 :       1.27 :       0.41
> NEURAL NET          :    0.16667 :       0.27 :       0.11
> LU DECOMPOSITION    :      5.969 :       0.31 :       0.22
> ===================ORIGINAL BYTEMARK RESULTS==============
> INTEGER INDEX       : 3.319
> FLOATING-POINT INDEX: 0.357
> =======================LINUX DATA BELOW===================
> MEMORY INDEX        : 0.907
> INTEGER INDEX       : 0.774
> FLOATING-POINT INDEX: 0.198
> 
> Not using this patch, I get:
> 
> TEST                : Iter/sec.  : Old Index  : New Index
>                     :            : Pentium 90 : AMD K6/233
> --------------------:------------:------------:-----------
> NUMERIC SORT        :     121.88 :       3.13 :       1.03
> STRING SORT         :     7.8438 :       3.50 :       0.54
> BITFIELD            : 2.2597e+07 :       3.88 :       0.81
> FP EMULATION        :     5.1424 :       2.47 :       0.57
> FOURIER             :     466.04 :       0.53 :       0.30
> ASSIGNMENT          :      1.809 :       6.88 :       1.79
> IDEA                :     359.28 :       5.50 :       1.63
> HUFFMAN             :     46.225 :       1.28 :       0.41
> NEURAL NET          :    0.16644 :       0.27 :       0.11
> LU DECOMPOSITION    :       5.77 :       0.30 :       0.22
> ===================ORIGINAL BYTEMARK RESULTS==============
> INTEGER INDEX       : 3.384
> FLOATING-POINT INDEX: 0.349
> =======================LINUX DATA BELOW===================
> MEMORY INDEX        : 0.922
> INTEGER INDEX       : 0.790
> FLOATING-POINT INDEX: 0.193
> 
> This patch doesn't bring any speedup in that case.
> 
> I guess we need more testing as a synthetic benchmark is as
> specific as kernel booting :-)
> 

This doesn't really surprise me. The goal of the patch is to remove the
limit of 16MB for the generated code. I really doubt you reach such a
limit in user mode unless you use some complex code.

On the other hand in system mode, this can be already reached once the
whole guest kernel is translated, so cached code is dropped and has to
be re-translated regularly. Re-translating guest code is clearly more
expensive than the increase of I/D cache and I/D TLB pressure.

The other way to allow more than 16MB of generated code would be to
disable direct jump on ARM. It adds one 32-bit constant loading + one
memory load, but then you don't have the I/D cache and TLB issue.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
address@hidden                 http://www.aurel32.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]