Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

On Wed, Feb 19, 2020 at 1:19 AM BALATON Zoltan <address@hidden> wrote:

While other targets take advantage of using host FPU to do floating
point computations, this was disabled for PPC target because always
clearing exception flags before every FP op made it slightly slower
than emulating everyting with softfloat. To emulate some FPSCR bits,
clearing of fp_status may be necessary (unless these could be handled
e.g. using FP exceptions on host but there's no API for that in QEMU
yet) but preserving at least the inexact flag makes hardfloat usable
and faster than softfloat. Since most clients don't actually care
about this flag, we can gain some speed trading some emulation
accuracy.

This patch implements a simple way to keep the inexact flag set for
hardfloat while still allowing to revert to softfloat for workloads
that need more accurate albeit slower emulation. (Set hardfloat
property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
still be room for further improvement but this seems to increase
floating point performance. Unfortunately the softfloat case is slower
than before this patch so this patch only makes sense if the default
is also set to enable hardfloat.

Because of the above this patch at the moment is mainly for testing
different workloads to evaluate how viable would this be in practice.
Thus, RFC and not ready for merge yet.

Signed-off-by: BALATON Zoltan <address@hidden>
---
v2: use different approach to avoid needing if () in
helper_reset_fpstatus() but this does not seem to change overhead
much, also make it a single patch as adding the hardfloat option is
only a few lines; with this we can use same value at other places where
float_status is reset and maybe enable hardfloat for a few more places
for a little more performance but not too much. With this I got:

lame: 3:13, lame_vmx: 1:55 (this is probably within jitter though and
still far from the results on real hardware) also tried mplayer test
and got results between 144-146s (this test is more VMX bound).

I've also done some profiling for hardfloat=true and false cases with
this patch to see what are the hot functions. Results are:

Overhead Command Symbol
-cpu G4,hardfloat=false, lame:
9.82% qemu-system-ppc [.] round_canonical
8.35% qemu-system-ppc [.] soft_f64_muladd
7.16% qemu-system-ppc [.] soft_f64_addsub
5.27% qemu-system-ppc [.] float32_to_float64
5.20% qemu-system-ppc [.] helper_compute_fprf_float64
4.61% qemu-system-ppc [.] helper_frsp
4.59% qemu-system-ppc [.] soft_f64_mul
4.01% qemu-system-ppc [.] float_to_float.isra.26
3.84% qemu-system-ppc [.] float64_classify
2.97% qemu-system-ppc [.] do_float_check_status

-cpu G4,hardfloat=false, lame_vmx:
Overhead Command Symbol
10.04% qemu-system-ppc [.] float32_muladd
9.49% qemu-system-ppc [.] helper_vperm
6.10% qemu-system-ppc [.] round_canonical
4.13% qemu-system-ppc [.] soft_f64_addsub
3.23% qemu-system-ppc [.] helper_frsp
3.13% qemu-system-ppc [.] soft_f64_muladd
2.88% qemu-system-ppc [.] helper_vmaddfp
2.69% qemu-system-ppc [.] float32_add
2.60% qemu-system-ppc [.] float32_to_float64
2.52% qemu-system-ppc [.] helper_compute_fprf_float64

-cpu G4,hardfloat=true, lame:
11.59% qemu-system-ppc [.] round_canonical
6.18% qemu-system-ppc [.] helper_compute_fprf_float64
6.01% qemu-system-ppc [.] float32_to_float64
4.58% qemu-system-ppc [.] float64_classify
3.87% qemu-system-ppc [.] helper_frsp
3.75% qemu-system-ppc [.] float_to_float.isra.26
3.48% qemu-system-ppc [.] helper_todouble
3.31% qemu-system-ppc [.] float64_muladd
3.21% qemu-system-ppc [.] do_float_check_status
3.01% qemu-system-ppc [.] float64_mul

-cpu G4,hardfloat=true, lame_vmx:
9.34% qemu-system-ppc [.] float32_muladd
8.83% qemu-system-ppc [.] helper_vperm
5.41% qemu-system-ppc [.] round_canonical
4.51% qemu-system-ppc [.] page_collection_lock
3.58% qemu-system-ppc [.] page_trylock_add.isra.17
2.71% qemu-system-ppc [.] helper_vmaddfp
2.53% qemu-system-ppc [.] float32_add
2.30% qemu-system-ppc [.] helper_compute_fprf_float64
2.21% qemu-system-ppc [.] float32_to_float64
2.06% qemu-system-ppc [.] helper_frsp

round_canonical seems to come up frequently in this with large overhead.

Could those with better test cases or benchmarks give it a test please
on different CPUs to see what else this would break?

---
fpu/softfloat.c | 14 +++++++-------
target/ppc/cpu.h | 2 ++
target/ppc/fpu_helper.c | 32 ++++++++++++++++----------------
target/ppc/translate_init.inc.c | 3 +++
4 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 301ce3b537..6d3f4af72a 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -216,15 +216,15 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64)
#endif

/*
- * Some targets clear the FP flags before most FP operations. This prevents
- * the use of hardfloat, since hardfloat relies on the inexact flag being
- * already set.
+ * Disable hardfloat for known problem cases.
+ * Additionally, some targets clear the FP flags before most FP operations.
+ * This prevents the use of hardfloat, since it relies on the inexact flag
+ * being already set and clearing it often may result in slower computations.
+ * Those targets could also be listed here.
*/
-#if defined(TARGET_PPC) || defined(__FAST_MATH__)
-# if defined(__FAST_MATH__)
-# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \
+#if defined(__FAST_MATH__)
+# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \
IEEE implementation
-# endif
# define QEMU_NO_HARDFLOAT 1
# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN
#else
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index b283042515..5f412f9fba 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1033,6 +1033,7 @@ struct CPUPPCState {
float_status vec_status;
float_status fp_status; /* Floating point execution context */
target_ulong fpscr; /* Floating point status and control register */
+ int default_fp_excpt_flags;

/* Internal devices resources */
ppc_tb_t *tb_env; /* Time base and decrementer */
@@ -1163,6 +1164,7 @@ struct PowerPCCPU {
void *machine_data;
int32_t node_id; /* NUMA node this CPU belongs to */
PPCHash64Options *hash64_opts;
+ bool hardfloat; /* use hardfloat (this breaks FPSCR[FI] bit emulation) */

/* Those resources are used only during code translation */
/* opcode handlers */
diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
index ae43b08eb5..bbbd1cb987 100644
--- a/target/ppc/fpu_helper.c
+++ b/target/ppc/fpu_helper.c
@@ -659,7 +659,7 @@ void helper_float_check_status(CPUPPCState *env)

void helper_reset_fpstatus(CPUPPCState *env)
{
- set_float_exception_flags(0, &env->fp_status);
+ set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status);
}

static void float_invalid_op_addsub(CPUPPCState *env, bool set_fpcc,
@@ -1823,7 +1823,7 @@ void helper_##name(CPUPPCState *env, ppc_vsr_t *xt, \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
t.fld = tp##_##op(xa->fld, xb->fld, &tstat); \
env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
\
@@ -1867,7 +1867,7 @@ void helper_xsaddqp(CPUPPCState *env, uint32_t opcode,
tstat.float_rounding_mode = float_round_to_odd;
}

- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
t.f128 = float128_add(xa->f128, xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;

@@ -1902,7 +1902,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
t.fld = tp##_mul(xa->fld, xb->fld, &tstat); \
env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
\
@@ -1942,7 +1942,7 @@ void helper_xsmulqp(CPUPPCState *env, uint32_t opcode,
tstat.float_rounding_mode = float_round_to_odd;
}

- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
t.f128 = float128_mul(xa->f128, xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;

@@ -1976,7 +1976,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
t.fld = tp##_div(xa->fld, xb->fld, &tstat); \
env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
\
@@ -2019,7 +2019,7 @@ void helper_xsdivqp(CPUPPCState *env, uint32_t opcode,
tstat.float_rounding_mode = float_round_to_odd;
}

- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
t.f128 = float128_div(xa->f128, xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;

@@ -2095,7 +2095,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb) \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
t.fld = tp##_sqrt(xb->fld, &tstat); \
env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
\
@@ -2143,7 +2143,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb) \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
t.fld = tp##_sqrt(xb->fld, &tstat); \
t.fld = tp##_div(tp##_one, t.fld, &tstat); \
env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
@@ -2305,7 +2305,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \
\
for (i = 0; i < nels; i++) { \
float_status tstat = env->fp_status; \
- set_float_exception_flags(0, &tstat); \
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \
if (r2sp && (tstat.float_rounding_mode == float_round_nearest_even)) {\
/* \
* Avoid double rounding errors by rounding the intermediate \
@@ -2886,7 +2886,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb)
uint64_t result, sign, exp, frac;

float_status tstat = env->fp_status;
- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);

sign = extract64(xb, 63, 1);
exp = extract64(xb, 52, 11);
@@ -2924,7 +2924,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb)
uint64_t helper_xscvspdpn(CPUPPCState *env, uint64_t xb)
{
float_status tstat = env->fp_status;
- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);

return float32_to_float64(xb >> 32, &tstat);
}
@@ -3327,7 +3327,7 @@ void helper_xsrqpi(CPUPPCState *env, uint32_t opcode,
}

tstat = env->fp_status;
- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
set_float_rounding_mode(rmode, &tstat);
t.f128 = float128_round_to_int(xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;
@@ -3384,7 +3384,7 @@ void helper_xsrqpxp(CPUPPCState *env, uint32_t opcode,
}

tstat = env->fp_status;
- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
set_float_rounding_mode(rmode, &tstat);
round_res = float128_to_floatx80(xb->f128, &tstat);
t.f128 = floatx80_to_float128(round_res, &tstat);
@@ -3415,7 +3415,7 @@ void helper_xssqrtqp(CPUPPCState *env, uint32_t opcode,
tstat.float_rounding_mode = float_round_to_odd;
}

- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
t.f128 = float128_sqrt(xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;

@@ -3449,7 +3449,7 @@ void helper_xssubqp(CPUPPCState *env, uint32_t opcode,
tstat.float_rounding_mode = float_round_to_odd;
}

- set_float_exception_flags(0, &tstat);
+ set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
t.f128 = float128_sub(xa->f128, xb->f128, &tstat);
env->fp_status.float_exception_flags |= tstat.float_exception_flags;

diff --git a/target/ppc/translate_init.inc.c b/target/ppc/translate_init.inc.c
index 53995f62ea..ab1a6db4f1 100644
--- a/target/ppc/translate_init.inc.c
+++ b/target/ppc/translate_init.inc.c
@@ -10736,6 +10736,8 @@ static void ppc_cpu_reset(CPUState *s)
/* tininess for underflow is detected before rounding */
set_float_detect_tininess(float_tininess_before_rounding,
&env->fp_status);
+ /* hardfloat needs inexact flag already set */
+ env->default_fp_excpt_flags = (cpu->hardfloat ? float_flag_inexact : 0);

for (i = 0; i < ARRAY_SIZE(env->spr_cb); i++) {
ppc_spr_t *spr = &env->spr_cb[i];
@@ -10868,6 +10870,7 @@ static Property ppc_cpu_properties[] = {
false),
DEFINE_PROP_BOOL("pre-3.0-migration", PowerPCCPU, pre_3_0_migration,
false),
+ DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
DEFINE_PROP_END_OF_LIST(),
};

--
2.21.1

From:	Yonggang Luo
Subject:	Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Date:	Fri, 10 Apr 2020 21:50:13 +0800