bug#7728: 24.0.50; GDB backtrace from abort

bug-gnu-emacs
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#7728: 24.0.50; GDB backtrace from abort

From:	Eli Zaretskii
Subject:	bug#7728: 24.0.50; GDB backtrace from abort
Date:	Thu, 13 Jan 2011 21:40:35 -0500
> From: "Drew Adams" <drew.adams@oracle.com>
> Cc: <monnier@iro.umontreal.ca>, <7728@debbugs.gnu.org>
> Date: Thu, 13 Jan 2011 17:19:43 -0800
> 
> > > In this case the `save-window-excursion' should amount to a 
> > > no-op in the end. The source and target window and frame need
> > > not be the same in general, but they are the same in the
> > > crashes I reported.
> > 
> > I don't believe this to be true, at least not from Emacs's internals
> > POV.  The code that crashes clearly executes the branch where the
> > frame recorded by save-window-excursion is NOT the selected frame by
> > the time the body of save-window-excursion is done being evaluated.
> 
> As I said, I followed the _source_ code in the debugger.  And the source code
> does not cause a crash.  The source code lets us know what _should_ be 
> happening
> here, not what is actually happening that provokes a crash.

Since you couldn't reproduce the crash under the Lisp debugger, the
evidence you collected during that debugging session is not really
admissible in the court of Emacs bugs ;-)

IOW, the backtrace you posted clearly shows that somehow,
save-window-excursion needed to switch frames, and its code that
restores the original window configuration therefore needed to select
a different frame.  That is a fact revealed by the C backtrace.  If we
want to make sure that this frame switch is real, I would suggest to
look at the values of sf and w->frame in this fragment from
select-window:

  sf = SELECTED_FRAME ();
  if (XFRAME (WINDOW_FRAME (w)) != sf)
    {
      XFRAME (WINDOW_FRAME (w))->selected_window = window;
      /* Use this rather than Fhandle_switch_frame
         so that FRAME_FOCUS_FRAME is moved appropriately as we
         move around in the state where a minibuffer in a separate
         frame is active.  */
      Fselect_frame (WINDOW_FRAME (w), norecord);

For that, you need to reproduce the crash, then go to the call-stack
frame where select-window (Fselect_window) invokes select-frame
(Fselect_frame).  In your original backtrace, this was frame #15:

 #12 0x01288ef3 in Fredirect_frame_focus (frame=93005829, focus_frame=93005829)
     at frame.c:2082
 #13 0x0127f4c8 in do_switch_frame (frame=93005829, track=1, for_deletion=0,
     norecord=49010714) at frame.c:847
 #14 0x01280733 in Fselect_frame (frame=93005829, norecord=49010714) at
 frame.c:899
 #15 0x01252702 in Fselect_window (window=93006853, norecord=49010714) at
 window.c:3581
 #16 0x0125e7c8 in Fset_window_configuration (configuration=99327941) at
 window.c:6148

So in that case, you would need to issue the following GDB commands:

 (gdb) frame 15
 (gdb) p sf->name
 (gdb) xstring
 (gdb) p w->frame
 (gdb) xframe

The 3rd and the 5th command will display the names of the two frames,
the one that's selected at this point, and the one to which the window
w (from the configuration being restored) belongs, respectively.  We
could then try to understand how come Emacs thinks it needs to switch
frames, while your analysis of the Lisp code suggests these two should
have specified the same frame.

(Note that frame #15 could have a different number in a different
crash, so look for the frame whose description is the same as what is
shown about, i.e. a call from Fselect_window to Fselect_frame, and use
the number of that frame.)

> > > * Let me repeat that the _source code works fine_ - no 
> > > error, no crash, no bug.
> > > 
> > > * Let me repeat too that the byte-compiled code (no matter 
> > > which Emacs version it was compiled with) works fine in all
> > > Emacs versions except the current development code - no error,
> > > no crash, no bug.
> > 
> > I don't think this to be relevant, sorry.
> 
> Why?  The only thing new to the mix is the new Emacs dev version.  The source
> code and the byte-compiled code are the same as before.  The regression is not
> realized using the source code.  It happens only with the new dev version when
> it executes the byte code.  Why isn't that relevant?

Because we have the C backtrace (thanks to you).  And that backtrace
speaks for itself.  There's nothing in it that cannot be understood
without invoking some non-trivial bug in the compiled byte code.  So,
while it's certainly possible that byte compilation has some unwanted
effect here, it sounds extremely unlikely, certainly not the first
explanation we should try.

> > I'm inclined to think that it's some weird side effect of
> > Edebug, or maybe something else.
> 
> You think _what_ is a weird effect of edebug?

The fact that uncompiled code seems to avoid the crash.

> With the debugger there was no crash.  So it certainly cannot be
> some weird effect of the debugger that is causing the crash.

The way I see it, you had a crash with byte-compiled code without the
debugger, and you had no crash with uncompiled code under the
debugger.  Which of these two variables caused the difference in
behavior remains to be seen.

> > > This is a _regression_ due to some change in the development
> > > version that no longer plays well with the byte-compiled code.
> > 
> > That's a possibility, but I think it's a remote one.
> 
> Seems more like an inescapable conclusion, to me.  Substitute any other Emcs
> version and presto: no problem.  Substitute the source code for the byte code
> and presto: no problem.

To really convince me in this, you would have to run Emacs under GDB,
using the source Lisp code, step through all the functions involved in
the crash, and show that the crash is indeed avoided, and why.

If you give me a reproducible recipe for the crash, I might try doing
this myself.

> > The offending code
> 
> What offending code?

The one that sets to nil the internal variable which holds the
selected window.  That's what triggers the crash, because way down the
call-stack, Emacs tries to reference the mode-line face of the frame
held in that variable.

> What you see as offending code, if it was already in 21.1, did not present a
> problem - it wasn't offending anyone.

Ever heard of bugs that lurk and rear their ugly head years after they
were introduced?

> > has been in Emacs since v21.1, so the problem is not new in any way.
> 
> Of course the problem is new.  It's a _regression_.

Only if the issue is looked at phenomenologically.  From my POV, this
bug was there for years.

> There is no such crash in any prior Emacs version.

But you have never before used any Emacs binary compiled with
ENABLE_CHECKING, did you?  Only such a version will crash, because it
does extra checking.

> You and I have different views of what "the problem"
> is, I guess.  For me, the problem is the crash.  That's new.

We can never fix the crash unless we understand what code causes it,
and why.  I posted here many messages ago why it crashes, and what
I found does not need to invoke any mysterious changes introduced
by the byte compiler to explain the crash.  It is crystal clear that,
under specific and well-defined circumstances set-window-configuration
and any code that calls it, including save-window-excursion, can crash
in the same way, if the window configuration being restored was
recorded in a different frame.  _That_ is the problem I'm trying to
fix in this bug.

While the crash in your specific use-case could indeed be new (if it
is explained by something other than the fact you are for the first
time using a binary compiled with ENABLE_CHECKING), the defect in the
code that I found and described could cause crashes in any number of
other use-cases, which have nothing to do with byte-compiling.  I'm
trying to find a solution for all those use-cases, not just for yours.

> > I think you interpret the latest messages incorrectly.  No one is
> > arguing that your code is the culprit.  The correct way to fix this
> > bug was pointed out by Stefan several messages ago, and I will do just
> > that when I have time.
> 
> I did not understand that you have a solution.  I didn't get that impression
> from your asking me to check the selected window in the debugger etc.

I asked that to have more evidence to back up my analysis.  It's never
a bad idea to look for more evidence, because sometimes it can
contradict the best hypothesis and change the whole picture.
[Prev in Thread]
Current Thread
[Next in Thread]
bug#7728: 24.0.50; GDB backtrace from abort, (continued)
Prev by Date: bug#7754: 24.0.50; easy-menu: fix misnamed keywords
Next by Date: bug#7728: 24.0.50; GDB backtrace from abort
Previous by thread: bug#7728: 24.0.50; GDB backtrace from abort
Next by thread: bug#7728: 24.0.50; GDB backtrace from abort
Index(es):
- Date
- Thread