emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MPS: Please check if scratch/igc builds with native compilation


From: Andrea Corallo
Subject: Re: MPS: Please check if scratch/igc builds with native compilation
Date: Tue, 21 May 2024 14:17:19 -0400
User-agent: Gnus/5.13 (Gnus v5.13)

Gerd Möllmann <gerd.moellmann@gmail.com> writes:

> Andrea Corallo <acorallo@gnu.org> writes:
>
>> At least here the error seems reproducible.  Bootstrapping with -j1
>> makes native compiling leim/ja-dic/ja-dic.el always fail.
>>
>> And if I run it under gdb I see we get a SIGSEGV in
>> 'maybe_resize_hash_table' at fns.c:4987
>>
>> memcpy (key, h->key, old_size * sizeof *key);
>
> That's a new one for me. Maybe you are hitting a read/write barrier?

Ah right maybe, interesting!

> I think Eli & Helmut can help here with what to do for the signals in
> GDB. (On macOS, MPS is using Mach exceptions, not signals.)
>
>>
>> with the following bt
>
>
>
>>
>> (gdb) bt
>> #0  maybe_resize_hash_table (h=0x7fffe7dabd48) at fns.c:4987
>> #1  hash_put (h=0x7fffe7dabd48, key=XIL(0x7fffe4fc297b), value=XIL(0x30), 
>> hash=1644298) at fns.c:5162
>> #2  0x0000555555817fc0 in Fputhash (key=XIL(0x7fffe4fc297b), 
>> value=XIL(0x30), table=<optimized out>) at fns.c:5993
>> #3  0x00007ffff14f6313 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #4  0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc010) at 
>> eval.c:3032
>> #5  0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #6  0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc0d0) at 
>> eval.c:3032
>> #7  0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #8  0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc190) at 
>> eval.c:3032
>> #9  0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #10 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc250) at 
>> eval.c:3032
>> #11 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #12 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc310) at 
>> eval.c:3032
>> #13 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #14 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc3d0) at 
>> eval.c:3032
>> #15 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #16 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc490) at 
>> eval.c:3032
>> #17 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #18 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc550) at 
>> eval.c:3032
>> #19 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #20 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc610) at 
>> eval.c:3032
>> #21 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #22 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc6d0) at 
>> eval.c:3032
>> #23 0x00007ffff14f6476 in 
>> F627974652d72756e2d2d73747269702d6c697374_byte_run__strip_list_0 () at 
>> /home/andcor03/emacs4/src/../native-lisp/30.0.50-00c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln
>> #24 0x00005555557fdbac in Ffuncall (nargs=2, args=0x7fffffffc760) at 
>> eval.c:3032
>> #25 0x00007ffff14f692c in 
>> F627974652d72756e2d73747269702d73796d626f6c2d706f736974696f6e73_byte_run_strip_symbol_positions_0
>>  ()
>> [...]
>>
>> Which is admittedly different to what I saw from command line.
>>
>>> To debug this, I changed the check in igc.c to not assert, but print
>>> the PID, and enter an endless loop sleeping. This makes it possible to
>>> attach to the process with LLDB.
>>>
>>> In all cases I investigated in this way, I'm seeing a pattern: What is
>>> happening is that a function in the Emacs core is called from a
>>> native-compiled function. Things look like, simplified,
>>>
>>>   /* In some .eln */
>>>   Lisp_Object d_reloc[100];
>>>
>>>   Lisp_Object some_native_compiled_lisp_function ()
>>>   {
>>>     Lisp_Object frame[2];
>>>     frame[0] = d_reloc[17]; // some symbol
>>>     frame[1] = ...
>>>     f_reloc->funcall (2, frame);
>>>   }
>>>
>>> where f_reloc is a large struct with function pointer members for
>>> function being called from the .eln. Doesn't matter. We then land in
>>> Ffuncall in the Emacs core, and the first element of its args vector,
>>> a symbol, is found to be forwarded which leads to the assertion.
>>>
>>> d_reloc in the .eln is scanned in igc.c, and it being on the control
>>> stack, in frame[], or in a register, should pin it, one would assume.
>>> So how comes Ffuncall in Emacs receives an invalid symbol?
>>>
>>> I've checked that d_reloc is indeed scanned by fix_comp_unit. The
>>> check gives me reasonable confidence that this "should work". But as
>>> an alternative, I also made all the things like d_reloc in the .elns
>>> ambiguous roots, so that they cannot possibly be moved, if all works as
>>> expected.
>>>
>>> - No change, it still asserts in the same way.
>>>
>>> - Changing optimization levels - no change.
>>> - Changing from arm64 to x86_64 - no change.
>>
>> That's very bizarre, I've hard time believing we are hitting such a bug :/
>> Hope we are missing something.
>
> Yes, bizarre is a good description. I'm out of ideas.

Do you think is very difficult to debug MPS to understand why a certain
object is being moved (while it should not)?  On GNU/Linux we can record
the rr trace (so that everything is reproducible) and do some back and
forward to try to spread some light on this maybe?

  Andrea



reply via email to

[Prev in Thread] Current Thread [Next in Thread]