qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to ho


From: Kirill Batuzov
Subject: Re: [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations
Date: Thu, 16 Oct 2014 15:07:25 +0400 (MSK)
User-agent: Alpine 2.02 (DEB 1266 2009-07-14)

On Thu, 16 Oct 2014, Alex Bennée wrote:

> >
> > From Valgrind experience there are enough genericism. Valgrind can translate
> > SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes 
> > are
> > reused between instruction sets.
> 
> Doesn't Valgrind have the advantage of same-arch->same-arch (I've not
> looked at it's generated code in detail though).
>

Yes, they have this advantage, but Valgrind tools look at intermediate
code in an architecture-independent way. For tools to work they need
to preserve opcode's semantics across different architectures. For
example Iop_QAdd16Sx4 (addition with saturation) must have the same
meaning on ARM (vqadd.s16 instruction) and on x86 (paddsw instruction).
So in most cases where Valgrind uses same opcode for different
instructions from different architectures QEMU can do the same.

> > But keep in mind - there are a lot of vector opcodes. Much much more than
> > scalar ones. You can see full list in Valgrind sources
> > (VEX/pub/libvex_ir.h).
> 
> I think we could only approach this is in a piecemeal way guided by
> performance bottlenecks when we find them.
> 

I'm not sure this will work. In my example larger part of speedup comes
from the fact that I could preserve value on registers and do not need
them to be saved and loaded for each vadd.i32 instruction. To be able to
do it on the real-life application we need to support as large fraction
of its vector instructions as possible. In short: the speedup does not
come from faster emulation of one instruction but from interaction
between sequential guest instructions.

> > We can reduce the amount of opcodes by converting vector element size from 
> > part
> > of an opcode to a constant argument. But we will lose some flexibility 
> > offered
> > by the TARGET_HAS_opcode macro when target has support for some sizes but 
> > not for
> > others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
> > does not have one for size i64x2. 
> >
> > Some implementation details and concerns.
> >
> > The most problematic issue was the fact that with vector registers we have 
> > one
> > entity that can be accessed as both global variable and memory location. I
> > solved it by introducing the sync_temp opcode that instructs register 
> > allocator to
> > save global variable to its memory location if it is on the register. If a
> > variable is not on a register or memory is already coherent - no store is 
> > issued,
> > so performance penalty for it is minimal. Still this approach has a serious
> > drawback: we need to generate sync_temp explicitly. But I do not know any 
> > better
> > way to achieve consistency.
> 
> I'm not sure I follow. I thought we only needed the memory access when
> the backend can't support the vector width operations so shouldn't have
> stuff in the vector registers?
> 

The target support for vector operations is not binary ("support all" or
"support none"). In most cases it will support some large subset but
some guest vector operations will be emulated. In that case we'll need
to access guest vector registers as memory locations.

Scalar operations which are not supported in opcodes are very uncommon
and a helper with large performance overhead is a reasonable option. I'd
like to avoid such heavy helpers in vector operations because
unsupported opcodes will be more common.

Another cause is the transition from existing code to vector opcodes.
During transition we'll have mix of old code (access as memory) and new
one (access as globals). Doing transition in one go is unrealistic.

> > Note that as of this RFC I have not finished conversion of ARM guest so 
> > mixing
> > NEON with VFP code can cause a miscompile.
> >
> > The second problem is that a backend may or may not support vector 
> > operations. We
> > do not want each frontend to check it on every operation. I created a 
> > wrapper that
> > generates vector opcode if it is supported or generates emulation code.
> >
> > For add_i32x4 emulation code is generated inline. I tried to make it a 
> > helper
> > but got a very significant performance loss (5x slowdown). I'm not sure 
> > about
> > the cause but I suspect that memory was a bottleneck and extra stores needed
> > by calling conventions mattered a lot.
> 
> So the generic helper was more API heavy than the existing NEON helpers?

Existing NEON implementation generates emulation code inline too. That
is how I found that my helper was slow.


-- 
Kirill

reply via email to

[Prev in Thread] Current Thread [Next in Thread]