qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}


From: Mark Cave-Ayland
Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
Date: Tue, 29 Jan 2019 18:49:18 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

On 27/01/2019 18:07, Richard Henderson wrote:

> On 1/27/19 9:45 AM, Mark Cave-Ayland wrote:
>>> I would expect the i < n/2 loop to be faster, because the assignments are
>>> unconditional.  FWIW.
>>
>> Do you have any idea as to how much faster? Is it something that would show
>> up as significant within the context of QEMU?
> 
> I don't have any numbers on that, no.
> 
>> As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
>> version much easier to read, so I would prefer to keep it if possible.
>> What about unrolling the loop into 2 separate ones...
> 
> I doubt that would be helpful.
> 
> I would think that
> 
> #define VMRG_DO(name, access, ofs)
> ...
>     int i, half = ARRAY_SIZE(r->access(0)) / 2;
> ...
>     for (i = 0; i < half; i++) {
>         result.access(2 * i + 0) = a->access(i + ofs);
>         result.access(2 * i + 1) = b->access(i + ofs);
>     }
> 
> where OFS = 0 for HI and half for LO is best.  I find it quite readable, and 
> it
> avoids duplicating code between LO and HI as you're currently doing.

Attached is my test program which benchmarks the different approaches across
0x8000000 iterations and gives the following sample output here on an i7 when
compiled with gcc -O2:

$ ./mcatest
Benchmark 1 - existing merge high
Elapsed time: 1434735 us

Benchmark 2 - v3 merge high
Elapsed time: 2603553 us

Benchmark 3 - 2 loops merge high
Elapsed time: 2395434 us

Benchmark 4 - Richard's merge high
Elapsed time: 1318369 us


These indicate that the proposed v3 merge algorithm is nearly 50% slower than 
the
existing implementation - it wasn't something noticeable during emulation, but 
in a
benchmark situation the additional overhead is clearly visible.

TLDR: after playing around with the different approaches, Richard's proposed
algorithm is the fastest, and is actually slightly quicker than the current
implementation. Please go to the foyer after class where you can collect your 
prize :)

On this basis I'll redo v3 using Richard's algorithm and post a v4 later 
assuming
that it passes my local tests again.


ATB,

Mark.

Attachment: mcatest.c
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]