lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Lightning 2.2.1 release


From: Paulo César Pereira de Andrade
Subject: Re: GNU Lightning 2.2.1 release
Date: Sat, 18 Feb 2023 13:24:23 -0300

Em sáb., 18 de fev. de 2023 às 11:40, Paul Cercueil
<paul@crapouillou.net> escreveu:
>
> Le samedi 18 février 2023 à 11:07 -0300, Paulo César Pereira de Andrade
> a écrit :
> > Em sáb., 18 de fev. de 2023 às 09:29, Paul Cercueil
> > <paul@crapouillou.net> escreveu:
> > >
> > > Hi Paulo,
> >
> >   Hi Paul,
> >
> > > Le vendredi 17 février 2023 à 16:23 -0300, Paulo César Pereira de
> > > Andrade a écrit :
> > > > GNU lightning 2.2.1 released!
> > > >
> > > > GNU lightning is a library to aid in making portable programs
> > > > that compile assembly code at run time.
> > > >
> > > > Development:
> > > > http://git.savannah.gnu.org/cgit/lightning.git
> > > >
> > > > Download release:
> > > > ftp://ftp.gnu.org/gnu/lightning/lightning-2.2.1.tar.gz
> > > >
> > > >   GNU Lightning 2.2.1 main new features:
> > > >
> > > > o Variable stack framesize implemented for aarch64, arm, i686,
> > > > mips,
> > > >   riscv, loongarch and x86_64. This means function calls use only
> > > >   the minimum required stack space for prolog and epilog.
> > > > o Optimization of prolog and epilog to not create a frame pointer
> > > > if
> > > >   not required, and not even save and restore the stack pointer
> > > > if
> > > >   not required on a leaf function. These features implemented for
> > > > the
> > > >   ports with variable stack framesize.
> > > > o New clor, czr, ctor and ctzr instructions, that count
> > > > leading/trailing
> > > >   zeros/ones. These use hardware implementation when available,
> > > > otherwise
> > > >   fallback to a software implementation.
> > >
> > > That's great. I actually had an alpha version of a patch that added
> > > clzr but never finished it.
> > >
> > > I think you could add an extra one, clsr, "count leading sign
> > > bits".
> > > The fallback should be very easy:
> > >
> > > jit_rshi(rn(tmp), r1, __WORDSIZE - 1);
> > > jit_xorr(rn(tmp), r1, rn(tmp));
> > > jit_clzr(r0, rn(tmp));
> >
> >   Yes. Fallback is simple. If I recall correctly, only arm64 has it
> > in hardware:
> >
> > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/CLS
> >
> >   I used it in the first version of clor for aarch64 when
> > experimenting with
> > instruction, but it did require branch, so, changed to just invert
> > bits and
> > use clz:
> > https://git.savannah.gnu.org/cgit/lightning.git/commit/?id=561eed91500f2a31ed9d4305c91940e742613ba8
> >
> > > Maybe adapted to only return the number of sign bits after the MSB
> > > to
> > > match GCC's __builtin_clrsb(), if it makes more sense.
> > >
> > > Speaking about fallbacks, the ones in place look very ineffective
> > > (e.g.
> > > the bit-swap to count trailing bits). I'm sure there are better
> > > algorithms; I'll have a look.
> >
> >   It is not even in jit_fallback.c. It is a version without lookup
> > tables nor
> > branches. I think libgcc variants use lookup tables. This is
> > something
> > to optimize.
>
> My point was that there are better ways to count trailing bits than
> bit-swapping.

  Sure. I just did want to have it working. Not fully optimized in
the first version :) Optimized versions should be with a lookup table
or some "magic" with float/double.
  There is also the comment in check/bit.c that says if the fallback
is used, it would be better to implement it as a function, then, it
just implements the fallbacks as jit functions.
  Using check/bit.tst is a good way to experiment with different
versions, before converting it to C code. Just change the "#if 0"
to "#if 1" and rewrite clo, clz, cto and ctz as appropriate, and
check output to validate it is correct.

> >    It is also a good extension for extra Lightning instructions. At
> > least
> > aarch64 and loongarch have a bit swap/invert instruction:
> > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/RBIT
> > https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_bitrev_wd
> >
> > > Also, you added SLL opcodes to "sign extend top 32 bits" on MIPS,
> > > but
> > > you do that if (__WORDSIZE == 32). What "top 32 bits" are we
> > > talking
> > > about there?
> >
> >   It is a SLL(r0, r1, 0) that is supposed to sign extend the value. I
> > do not
> > have access to any mips release 6, so did not test the mips6_p() code
> > variant.
>
> I tested MIPSr6 a few months ago and it didn't go very well, some
> instructions that Lightning emit did change (for instance, the LO/HI
> registers are gone, and all opcodes touching those changed).

  Did you test in real hardware or qemu?

  I might setup a qemu environment, but would be far better to test
in real hardware. Qemu mips emulation last time I tested was way
too slow...

> > The documentation I did use (MD00087-2B-MIPS64BIS-AFP-6.06.pdf) says:
> >
> > """
> > Format: CLO rd, rs                                 MIPS32
> > Purpose: Count Leading Ones in Word
> > To count the number of leading ones in a word.
> > ...
> > Restrictions:
> > Pre-Release 6: To be compliant with the MIPS32 and MIPS64
> > Architecture, software must place the same GPR num-
> > ber in both the rt and rd fields of the instruction. The operation of
> > the instruction is UNPREDICTABLE if the rt and
> > rd fields of the instruction contain different values. Release 6’s
> > new
> > instruction encoding does not contain an rt field.
> >
> > If GPR rs does not contain a sign-extended 32-bit value (bits 63..31
> > equal), then the results of the operation are
> > UNPREDICTABLE.
> > """
>
> Yes, but in the case where __WORDSIZE == 32, bits 63..32 do not exist.
> Therefore the sign-extension does nothing.

  The common case is a 32 bit OS in a 64 bit cpu. This is also how it
was tested. If the condition of a "true" 32 bit cpu can be detected,
then could add a jit_cpu_t flag to know about it, and omit the sign
extension.

> >   I did Lightning 2.2.1 release to have public several bug fixes, but
> > I hope to add extra bit manipulation instructions. At least:
> >
> > o bit invert
> > o popcount
> > o bit rotate
> >
> >   But there are several other that are useful, like ways to create
> > bit patterns for any kind of masks. These could at least be used
> > internally to create constants with repeated patterns.
> >
> >   If you have other suggestions for new instructions, please let me
> > now :)
>
> Honestly, apart from the "CLS" mentioned before and maybe popcount, I
> wouldn't have any use for these - in my particular usecase anyway.
>
> I would maybe benefit from having "mask extract" and "mask insert"
> functions similar to EXT/INS on MIPS.
>
> But in general I like that Lightning is very RISC-like and I would
> avoid making it more complex adding instructions that would almost
> never be used.
>
> >   One such instruction could be "multiply and add", available in
> > several
> > cpus.
> >
> >   On the long term can add int128 and complex float/double. I would
> > like to have it, but implementing in all ports is not trivial, and
> > would
> > require the concept of register pairs, currently only barely used for
> > qdiv/qmul, and only to put the result pair, not as input.
> >
> >   Maybe could add a way to inject machine code also, just memcpy
> > a buffer. This could allow to make optimizations where lightning does
> > not generate good code, just experiment it with an assembler, then,
> > when happy with the code, inject it in the jit code.
>
> One thing somewhat related that would be very useful to me, is
> patchable jumps after code generation.
>
> Basically, if you emit:
>
> lbl = jit_jmpi();
> jit_patch_abs(lbl, my_fn);
>
> ...
> jit_emit();
> addr = jit_address(lbl);
>
> You would then be able to change the function called using something
> like:
>
> jit_patch_again(addr, my_other_fn);

  It would be required to unmap and remap the code buffer.

  Part of it is done in the example in check/protect.c. After
that, currently would need to manually patch it, basically copying
the _patch_at() specific to the architecture where it is implemented.
  If it is not really in some inner loop that needs to be as fast as
possible, could load the pointer from a constant pool.

> Cheers,
> -Paul
>
>
> > > Cheers,
> > > -Paul
> > >
> > > > o Correct several bugs with jit_arg_register_p and
> > > > jit_putarg{r,i}{_f,_d}.
> > > >   These bugs were not noticed earlier due to an incorrect check
> > > > for
> > > >   correctness in check/carg.c.
> > > > o Add rip relative addressing support for x86_64 and shorter
> > > > signed
> > > > 64
> > > >   bit constant load if the constant fits in a signed 32 bit
> > > > integer.
> > > >   This significantly reduces code size generation.
> > > > o Correct bugs in branch generation code for pppc and sparc.
> > > > o Correct bug in signed 32 bit integer load in ppc 64 bits.
> > > > o Add short relative unconditional branches and calls to mips,
> > > > reducing
> > > >   code size generation.
> > > > o And several extra minor optimizations.
> > > >
> >
> > Thanks,
> > Paulo
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]