lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] ppc: Fix 'calli' when floating-point arguments are passed


From: Paulo César Pereira de Andrade
Subject: Re: [PATCH] ppc: Fix 'calli' when floating-point arguments are passed
Date: Sat, 10 Sep 2022 07:12:51 -0300

Em sáb., 10 de set. de 2022 às 03:53, Paul Cercueil
<paul@crapouillou.net> escreveu:
>
> Hi Paulo,

  Hi Paul,

> >>  >>  > >>  > [snip]
> >>  >>  > >>  >
> >>  >>  > >>  > >>  The very last break happened as it was calling
> >>  >> jit_new_state()
> >>  >>  > >>  > >> again,
> >>  >>  > >>  > >>  so I suppose the memory address that was being
> >> watched
> >>  >> was
> >>  >>  > >> being
> >>  >>  > >>  > >> reused
> >>  >>  > >>  > >>  and I stopped the debugging session there.
> >>  >>  > >>  > >
> >>  >>  > >>  > >   Please just do call jit_print() after finishing
> >>  >>  > >> split_branches, for
> >>  >>  > >>  > > example:
> >>  >>  > >>  > >
> >>  >>  > >>  > > Breakpoint 1, _split_branches (_jit=0x10111270) at
> >>  >>  > >> lightning.c:2713
> >>  >>  > >>  > > 2713        for (node = _jitc->head; node; node =
> >> next) {
> >>  >>  > >>  > > (gdb) call _jit_print(_jit)
> >>  >>  > >>  > > L0: /* prolog */
> >>  >>  > >>  > >     #name
> >>  >>  > >>  > >     #note 395
> >>  >>  > >>  > > ...
> >>  >>  > >>  > >
> >>  >>  > >>  > > and let me know the output. The likely explanation
> >> for it
> >>  >> not
> >>  >>  > >> being
> >>  >>  > >>  > > set,
> >>  >>  > >>  > > assuming lightning code is not modified, is that
> >> there is
> >>  >> some
> >>  >>  > >> extra
> >>  >>  > >>  > > code changing the value of r26 after it is used as a
> >>  >> temporary.
> >>  >>  > >>  > > Something
> >>  >>  > >>  > > like in the range from "L2:" to the "stxi_i 0x3e7c r29
> >>  >> r26"
> >>  >>  > >> below,
> >>  >>  > >>  > > there is
> >>  >>  > >>  > > some instruction that changes r26, and causes it to
> >>  >> assume it
> >>  >>  > >> is dead
> >>  >>  > >>  > > in
> >>  >>  > >>  > > that range, and usable as a temporary.
> >>  >>  > >>  >
> >>  >>  > >>  > I'll do better, I'll give you an example that
> >> reproduces the
> >>  >>  > >> issue :)
> >>  >>  > >>  > See the attached example.
> >>  >>  > >>  >
> >>  >>  > >>  > Your idea that something must be changing r26 in that
> >> slice
> >>  >> of
> >>  >>  > >> code
> >>  >>  > >>  > pointed me in the right direction.
> >>  >>  > >>  > I only changed one line in the example: there is now a
> >>  >>  > >> jit_movi(r26, 0)
> >>  >>  > >>  > line 58.
> >>  >>  > >>  >
> >>  >>  > >>  > I assume Lightning detects that r26 is already 0 anyway
> >>  >> (because
> >>  >>  > >> of the
> >>  >>  > >>  > beqi just before) so it removes this instruction,
> >> causing
> >>  >> havoc
> >>  >>  > >> at the
> >>  >>  > >>  > same time.
> >>  >>  > >>
> >>  >>  > >>    Exactly, the instruction is being removed, and the
> >> optimizer
> >>  >>  > >> getting
> >>  >>  > >>  confused.
> >>  >>  > >>    This should not happen as the optimizations done are
> >>  >> supposed to
> >>  >>  > >>  be the fool proof ones, but somehow this pattern
> >> triggered a
> >>  >> bug.
> >>  >>  > >>    I will try to understand it and have a patch later
> >> today.
> >>  >>  > >
> >>  >>  > >   Just in case, this should work as a temporary workaround:
> >>  >>  > >
> >>  >>  > > """
> >>  >>  > > diff --git a/lib/lightning.c b/lib/lightning.c
> >>  >>  > > index d219e6d..c4f2e91 100644
> >>  >>  > > --- a/lib/lightning.c
> >>  >>  > > +++ b/lib/lightning.c
> >>  >>  > > @@ -1702,12 +1702,14 @@ _jit_optimize(jit_state_t *_jit)
> >>  >>  > >             case jit_code_epilog:
> >>  >>  > >                 _jitc->function = NULL;
> >>  >>  > >                 break;
> >>  >>  > > +#if 0
> >>  >>  > >             case jit_code_beqi:
> >>  >>  > >                 redundant_store(node, 1);
> >>  >>  > >                 break;
> >>  >>  > >             case jit_code_bnei:
> >>  >>  > >                 redundant_store(node, 0);
> >>  >>  > >                 break;
> >>  >>  > > +#endif
> >>  >>  > >             default:
> >>  >>  > >  #if JIT_HASH_CONSTS
> >>  >>  > >                 if (mask & jit_cc_a0_flt) {
> >>  >>  > > """
> >>  >>  >
> >>  >>  > I simply commented the 'redunant_store()' calls as I didn't
> >> know
> >>  >> if it
> >>  >>  > was OK for the jit_code_beqi/jit_code_bnei cases to fall in
> >> the
> >>  >>  > "default" case.
> >>  >>
> >>  >>    It should go to the default case, because it might need to
> >>  >> allocate
> >>  >>  a constant to be loaded, or if in a function body, need to know
> >> the
> >>  >>  register is modified, what only matters if it is callee save;
> >> this
> >>  >> was
> >>  >>  a second problem in this code path; if it did not remove the
> >> movi
> >>  >>  in redundant_store() it should still go to the default case, or,
> >>  >> could
> >>  >>  not detect a callee save register is modified (likely it was
> >> already
> >>  >>  noted as modified, but maybe in some very special scenario it
> >>  >>  could not notice that...).
> >>  >>
> >>  >>  > Looks like the workaround works, thanks for the suggestion.
> >>  >>
> >>  >>    The problem was that it basically did add an "implicit" note
> >> note
> >>  >> that
> >>  >>  the register was dead in the range from L2 to  the
> >>  >> "jit_movi(JIT_V1, 0)",
> >>  >>  so, it was safe to use it as temporary, but when removing the
> >>  >> jit_movi
> >>  >>  in another optimization pass, it broke the logic.
> >>  >>
> >>  >>  > >   I will try to work on a patch, to have the live register
> >>  >> computation
> >>  >>  > > work as if the movi was still done, or just remove and no
> >>  >> longer use
> >>  >>  > > redundant_store, as this movi removal is causing the
> >> problem.
> >>  >>  > >
> >>  >>  > > """
> >>  >>  > >         case jit_code_movi:
> >>  >>  > >         if (regno == jit_regno(iter->u.w)) {
> >>  >>  > >             if (iter->flag || iter->v.w != word)
> >>  >>  > >             return;
> >>  >>  > >             del_node(prev, iter);
> >>  >>  > >             iter = prev;
> >>  >>  > >         }
> >>  >>  > >         break;
> >>  >>  > > """
> >>  >>  > >
> >>  >>  > >   The problem is that the instruction is removed after the
> >>  >> computation
> >>  >>  > > of the initial state at the start of the block.
> >>  >
> >>  >   I just pushed a commit with the easiest and I believe safest
> >>  > approach to
> >>  > keep the optimization and prevent the misuse as a temporary. It
> >> just
> >>  > turns the jit_movi into a jit_live.
> >>  >   This works because the jit_movi() could be seen as the non
> >> existing
> >>  > jit_dead()* call. Because it was not used without modification up
> >> to
> >>  > that
> >>  > removed jit_movi, the start of the block did assume it was dead at
> >>  > block entry. It would be too complicated to fully understand the
> >>  > condition
> >>  > and (not too much) costly to rebuild the initial state. Still, the
> >>  > only case
> >>  > this approach is bad is if the movi would end up in a dead code
> >> path,
> >>  > preventing it from being used as a cheap temporary, or, if run
> >> out of
> >>  > temporaries, need a spill/reload.
> >>
> >>  Sadly it does cause more issues than it solves.
> >>  Attached is a piece of code that I generate, with the *workaround*
> >> in
> >>  place (not your fix).
> >
> >   Please confirm the workaround works as expected.
>
> Yes, everything works fine with the workaround in place.

  The "workaround" removes the somewhat dangerous optimization where
it follows a value in another basic block, and deletes a redundant reload.

> >>  With the fix in place, the CMPWI and BEQ lines 157-158 are missing.

  There is no static analysis code to handle a case like, if knowing the
value, convert a beq in either an unconditional jump, or remove it if
it would be always false. Likely it was a side effect of other removal
or transformation.

> >>  Besides, I can't give you a jit_print() trace of this code with the
> >> fix
> >>  in place, because it segfaults:
> >
> >   Ok. I was thinking a bit about it later, and it would only work for
> > the
> > reproducer. It was still buggy. Should either recompute the initial
> > state at start of the block or leave the redundant register
> > rematerialization,
> > to the known value, in place.
> >
> >>  #0  0x3f390ee0 in strlen () from
> >>
> >> /home/paul/dev/toolchains/powerpc/powerpc-buildroot-linux-gnu/sysroot/lib/libc.so.6
> >>  #1  0x3f368d64 in fputs () from
> >>
> >> /home/paul/dev/toolchains/powerpc/powerpc-buildroot-linux-gnu/sysroot/lib/libc.so.6
> >>  #2  0x3f5a6248 in _jit_print_node (_jit=_jit@entry=0x402dcbb0,
> >>  node=node@entry=0x40312130) at jit_print.c:178
> >>  #3  0x3f5a700c in _jit_print (_jit=_jit@entry=0x402dcbb0) at
> >>  jit_print.c:73

  This was one of the reasons it was still buggy, it kept in the loop removing
other redundant loads.

   These optimizations were done looking at assembly generation in
my  use. But my usage would just map virtual registers to hardware
registers, and would either have global pointers in registers, or have
very uncommon use of registers live for more than one basic block.

  These optimizations would work perfectly if there was at least one
register reserved specifically as assembly temporary, and do spill/reload
for some other usages if there was only one assembly temporary.
This would be ok in architectures with a rich set of registers.

  The problem is that the removal of the instruction doing the redundant
reload to the same value did confuse the code that follows code
generation, and caused it to think the register was dead, as it would
be loaded shortly after, but that was not being done anymore.

  The proper fix should be to pretend the removed instruction was
never added, know the code was modified and redo the jit_setup()
and jit_follow() steps. This way the optimization is not removed,
and also removes the danger of some yet unknown similar pattern
in the other removals in simplify().
  Basically, would be like our attempts to create a reproducer, we
just started with a state where the instruction would not be removed,
and this way, the bug would not happen.

> >>  > * could add a jit_dead() instruction, this could be useful when
> >> doing
> >>  > a
> >>  > jump to a non constant address or address not in jit code. This
> >> would
> >>  > be used for example, when landing in a code that reloads the
> >> called
> >>  > save register value from the stack, and is a hint that it can be
> >> used
> >>  > safely as temporary from the jit_dead() to the jump. For the
> >> moment,
> >>  > just something to think about for the near future.
> >>
> >>  That sounds like a good idea. I manage my own register allocation
> >> so I
> >>  can tell Lightning exactly when a register is dead.
> >
> >   To fully work it would need an extra regset mask, otherwise it could
> > cause bugs in unexpected conditions.  But it would not be too costly,
> > as it would not be an extra regset per block, but only an extra one in
> > the _jit->comp structure.
>
> Although I believe we agree that Lightning should generate working code
> even if jit_dead() is not used, correct? And jit_dead() would only
> permit better code generation.

  Sure, it would be just an optimization. It might not be a good idea because
it would allow bad code generation if used incorrectly. It would only make
sense for a inverse use of what you do with jit_live(_R10), for example, if
100% certain JIT_V2 would be reloaded, could add it before the jumpi
to the address in jit generated by another context. Other usages would
probably be redundant, as lightning already understands when a register
value is dead.

> Cheers,
> -Paul
>

Thanks,
Paulo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]