lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] jit_size: Don't round up the size to a multiple of 4 KiB


From: Paulo César Pereira de Andrade
Subject: Re: [PATCH] jit_size: Don't round up the size to a multiple of 4 KiB
Date: Tue, 7 Jun 2022 16:26:53 -0300

Em ter., 7 de jun. de 2022 às 16:07, Paul Cercueil
<paul@crapouillou.net> escreveu:
>
>
>
> Le mar., juin 7 2022 at 15:39:35 -0300, Paulo César Pereira de Andrade
> <paulo.cesar.pereira.de.andrade@gmail.com> a écrit :
> > Em ter., 7 de jun. de 2022 às 13:13, Paul Cercueil
> > <paul@crapouillou.net> escreveu:
> >>
> >>  Hi Paulo,
> >
> >   Hi Paul,
> >
> >>  Le lun., juin 6 2022 at 16:03:04 -0300, Paulo César Pereira de
> >> Andrade
> >>  <paulo.cesar.pereira.de.andrade@gmail.com> a écrit :
> >>  > Em dom., 5 de jun. de 2022 às 07:25, Paul Cercueil
> >>  > <paul@crapouillou.net> escreveu:
> >>  >
> >>  >   Hi,
> >>  >
> >>  >   Applied, but some considerations below.
> >>  >
> >>  >>  When using an external code buffer, a program using Lightning
> >> will
> >>  >> call
> >>  >>  jit_get_code() to get the expected size of the code, then
> >> allocate
> >>  >> some
> >>  >>  space in the code buffer, then call jit_set_code() to specify
> >> where
> >>  >> the
> >>  >>  code should be written to.
> >>  >
> >>  >   There should be some api to reset _jit->user_code, and use the
> >> patch
> >>  > based on if _jit->user_code. Also, likely instead of rounding to
> >> 4096
> >>  > bytes,
> >>  > round to pagesize. The rounding up is just to "adapt" to what mmap
> >>  > does.
> >>  >
> >>  >>  If the reported size is rounded up to a multiple of 4 KiB, then
> >> the
> >>  >>  allocator will always try to allocate 4 KiB for blocks of code
> >> that
> >>  >>  might even be smaller than a hundred bytes. The program can then
> >>  >> choose
> >>  >>  to realloc() the allocated block to the actual size of the
> >> generated
> >>  >>  code, to reduce the memory used.
> >>  >
> >>  >   It is required some extra logic to allow easy reutilization of a
> >>  > _jit context.
> >>  > The current logic is good for significantly large jit buffers, but
> >>  > too costly
> >>  > for code doing very small buffers. Resetting the internal state
> >> of a
> >>  > _jit context
> >>  > should be far faster than deleting a context and creating a new
> >> one.
> >>
> >>  That's something I could actually use. Right now I have $(nproc)
> >> worker
> >>  threads compiling blocks of code in parallel. This can lead  to
> >>  compiling 1000+ blocks per second, each one with its own _jit
> >> context.
> >>  I could use a _jit context per worker instead.
> >>
> >>  With that said - my setup has been working pretty well so far.
> >
> >   It would be interesting to have a debug build of lightning and run
> > your setup under 'perf record ...', to see where most time is being
> > spent in lightning code. It should be either in _jit_setup/_jit_follow
> > if large jit buffers, or in context creation/deletion if small jit
> > buffers.
> >
> >   Can you do it? Then just share information about most cpu time
> > in lightning, e.g. 'perf report | less' and get first entries.
>
> Samples: 819K of event 'cpu-clock:uhpppH', Event count (approx.):
> 204934500000, DSO: liblightning.so.1.0.0
>   Children      Self  Command   Symbol
>      0.87%     0.87%  pcsx4all  [.] _jit_optimize
>      0.58%     0.58%  pcsx4all  [.] _new_node
>      0.21%     0.21%  pcsx4all  [.] _jit_classify
>      0.17%     0.17%  pcsx4all  [.] _jit_reglive
>      0.15%     0.15%  pcsx4all  [.] _jit_update
>      0.11%     0.11%  pcsx4all  [.] _emit_code
>      0.10%     0.10%  pcsx4all  [.] jit_regset_scan1
>      0.06%     0.06%  pcsx4all  [.] _jit_data
>      0.05%     0.05%  pcsx4all  [.] _jit_regarg_set
>      0.05%     0.05%  pcsx4all  [.] _jit_regarg_clr
>      0.02%     0.02%  pcsx4all  [.] _jit_regarg_p
>      0.02%     0.02%  pcsx4all  [.] _jit_new_node_www
>      0.02%     0.02%  pcsx4all  [.] _jit_get_size
>      0.01%     0.01%  pcsx4all  [.] _jit_name
>      0.01%     0.01%  pcsx4all  [.] _jit_new_node
>      0.01%     0.01%  pcsx4all  [.] _jit_note
>      0.01%     0.01%  pcsx4all  [.] _jit_trampoline
>      0.01%     0.01%  pcsx4all  [.] _jit_emit
>      0.01%     0.01%  pcsx4all  [.] _jit_link
>      0.01%     0.01%  pcsx4all  [.] _simplify_movi
>      0.01%     0.01%  pcsx4all  [.] jit_alloc
>      0.01%     0.01%  pcsx4all  [.] _andi
>      0.01%     0.01%  pcsx4all  [.] _jit_patch_at
>      0.01%     0.01%  pcsx4all  [.] _jit_prolog
>      0.01%     0.01%  pcsx4all  [.] _jit_new_node_no_link
>      0.01%     0.01%  pcsx4all  [.] _stxi_i
>
> The rest is at 0.00%. That's on MIPS32r2.

  For these values there isn't much to gain in optimizing jit
context reuse.

> >>  >>  However, this will cause dramatic memory fragmentation; for
> >>  >> instance,
> >>  >>  if working with a 2 MiB code buffer, in which 512 blocks of 4
> >> KiB
> >>  >> are
> >>  >>  allocated but later realloc'd to 128 bytes each, the total
> >> amount of
> >>  >>  allocated memory will be 128 * 512 == 64 KiB, with almost 1.9
> >> MiB
> >>  >> free,
> >>  >>  yet it will be impossible to allocate any new blocks as there
> >> would
> >>  >> be
> >>  >>  no way to find a contiguous 4 KiB area.
> >>  >
> >>  >   Resetting a jit_context could also have some "self healing"
> >> code,
> >>  > in case
> >>  > it has a too large _jit->*.{count,length}. Most likely one to use
> >>  > more memory
> >>  > is _jit->pool.ptr, if some very large jit code buffer was written
> >> (it
> >>  > always rounds
> >>  > up to 1024 free nodes when running out of nodes).
> >>  >
> >>  >>  Besides, I really don't understand why it was rounded up to a
> >>  >> multiple
> >>  >>  of 4 KiB, as this is not a requirement for mmap().
> >>  >
> >>  >   Usually the memory rounding up to 4096 will either not be
> >>  > accessible (and
> >>  > not used by any other mmap call) or the code will just refuse to
> >>  > write to that
> >>  > extra memory. It was done so, just in case it miscalculated by a
> >> few
> >>  > bytes
> >>  > the code size, to not need to mmap or mremap again.  If mremap is
> >>  > available,
> >>  > that logic is mostly pointless...
> >>
> >>  What I was wondering was why it uses a sum of "maximum opcode size"
> >> as
> >>  the code's buffer size. Wouldn't it be possible to have two
> >> jit_emit()
> >>  passes, the first one incrementing a byte counter, the second one
> >>  actually writing data?
> >
> >   This would usually not be cheap. It could use significant time
> > keeping track of registers live state, to just throw it away and
> > restart.
> >   For very small jit buffers that could work, as overhead would be
> > minimal.
> >   Usually the sum of "maximum opcode size" is very close to the
> > amount of bytes that will be used.
>
> Actually, if I remember correctly my tests, it was closer to 2-3x the
> amount of bytes emitted.

  Maybe instead of getting the maximum instruction size, the
special build that gets the size should use the average value.
  Debugging should be somewhat simple, just output what
it estimates as code size and what is actually used, then,
based on this data choose a better pattern.

  Usually JIT_INSTR_MAX might be very large, if during data
collection (special build with --devel-get-jit-size,) a very large
prolog/epilog was generated, what is common for the several
stress tests in make check.
  During code generation, it always check if there are at least
JIT_INSTR_MAX bytes remaining, to avoid writing out of bounds.

> Cheers,
> -Paul

Thanks,
Paulo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]