Re: Reboots?

bug-hurd
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reboots?

From:	Roland McGrath
Subject:	Re: Reboots?
Date:	Mon, 2 Apr 2001 21:15:10 -0400 (EDT)
> When in gdb after the crash, can I somehow figure out what the memory area
> looks like which contains the stack? 

Certainly.  The cthreads structure for the thread (struct cproc) stores the
bouns of the stack, so you know what memory region it lies in.  If it's
large, you can use vminfo (if you have a sub-hurd anyway, or maybe gdb has
a vminfo-like command?) to figure out which parts are unused zero-fill that
has never been touched (so you don't scan all that), or you can just use
gdb's commands (I've forgotten what the command is called) to search
through large amounts of memory and skip the zeros.

> Maybe the amount and content of wrong data provides some data points
> (like, a memory page of zeros would give a different impression than a
> small sequence of small integers).

Certainly.  If the top of the stack (lowest addresses used) has hundreds or
thousands of bytes of data that's obviously not call frames, then we may
learn something about the clobberation from the contents of that data, but
it will be difficult or impossible to figure out what code was running
because the call frames were completely clobbered.

> I can figure out how to dump memory ranges, but I don't know how to get
> the address of the stack (and I wonder if it wouldbe still valid after
> the crash. gdb info doc has something about a frame pointer register,
> which would likely be damaged. Maybe one can go the chain down from the
> main frame, or so? I really don't know).

You can indeed start from the base of the stack and try to work your way
up, but that can be pretty difficult.  The frames are stored in a way
intended to make it easy to go in the other direction, i.e. unwind.

You probably have a better chance of just guessing whereabouts the current
frame was on the stack from other register values and so forth, and then
unwinding.  It really depends on how much clobberation there is.

> I would like to get an idea of the amount of corruption going on at a very
> low (binary data) level, if feasible.

Well unless there is a really obvious data pattern (the ASCII values of
"Kilroy was here", for example ;) then you have a lot of work to do to
figure out what the stack is supposed to contain for comparison.

It seems like a quick lesson on stack frames is in order.  

The x86 has the canonical sort of call frames for non-RISC machines.  The
relevant special-purpose registers are the PC (%eip), stack pointer (%esp),
and frame pointer (%ebp).  

The stack grows down, so PUSH X means *--((int*)SP) = X
and POP X means X = *((int*)SP)++ in C syntax.

A call instruction CALL FUNC is the same as:

        PUSH PC
        JMP FUNC

A return instruction is the same as:

        POP PC

(On the x86 and many other machines, you can't actually refer to the PC
register as a normal register like this, only by special call/return insns.)
Function arguments are pushed on the stack, so a call FUNC(A,B,C) does:

        PUSH C                  pushl %ecx
        PUSH B                  pushl %ebx
        PUSH A                  pushl %eax
        CALL FUNC               call FUNC
        POP [3 words]           addl $12,%esp

When a function wants to use the stack for anything (this means most every
function, except highly-optimized special cases or highly-optimized leaf
functions), it allocates a call frame on the stack and uses the frame
pointer to keep track of this.  Functions start with:

        PUSH FP                 pushl   %ebp
        FP = SP                 movl    %esp,%ebp

Then they save any callee-saves registers they are going to use:

                                pushl   %esi
                                pushl   %ebx

(On the x86, %eax,%ecx,%edx are the call-clobbered registers, which means
that a called function can just use them and leave garbage values when it
returns; actually the value left in %eax is the return value of the
function, and a combination of registers is used for long long or struct
return values.  The calling convention requires that a called function
preserve the starting values of all the other general registers when it
returns.)

Then it allocates stack space for local variables:

        SP -= N                 subl    $16,%esp  # say 4 words of locals

A function with a frame pointer is free to push more on the stack later and
not keep track of exactly how much.  The pathological example of this is
alloca; X = alloca(N) just does:

        SP -= N                 subl    %ecx,%esp
        X = SP                  movl    %esp,%eax

When the function needs to refer to its arguments, it just uses FP as a
pointer, FP+4 is the first argument, FP+8 the second, and so forth.
(The word at FP itself is the caller's saved FP value.)  Local variable
space is likewise referred to relative to FP: FP-16 is the first word of
local variable space, FP-12 the second, FP-4 the last.  For example,
{ int x,y; double f; foobar(&x,&y,&f); } might be:

        PUSH FP-20              # actually requires a few insns with an
        PUSH FP-8               # intermediate register
        PUSH FP-4
        CALL foobar             call foobar
        POP [3 words]           addl $12,%esp

(Note that in our example above, the function used the bytes at [FP-8,FP) 
to store the starting values of %esi and %ebx.  So the highest-addressed
word of local variable space in that function would actually be at FP-12.
The foobar example presumes no register save space.)

When a function returns, it unwinds its call frame with:

        SP = FP - saved         leal -8(%ebp),%esp
        POP saved registers     popl %ebx
                                popl %esi
        POP FP                  popl %ebp
        RETURN (i.e. POP PC)    ret

Because the SP is always restored from the FP, the function is free to move
the SP around (i.e. push things) within the function and not always clean
them up.  So in optimized code you will sometimes see the compiler omit the
pop of the arguments after a call, because they will be implicitly popped
along with the whole call frame on return (it doesn't make this
optimization for a call inside a loop, since the amount of stack space
pushed and not popped multiplies by the iterations).

What gdb does to give you a backtrace (or any other info like function
arguments or local variables) is start with the current PC and FP values to
see what code is running and what its call frame contains.  Without knowing
anything about the code in question, just going on the presumption that
it's using the normal frame pointer convention gdb can look at *(FP-4) and
see the return address in the calling function (the address of the
instruction immediately after his "call" instruction); in *FP it can see
the caller's FP value, and iterate the procedure using those saved PC and
FP values found on the stack in place of the actual register values that
pertain to the innermost call frame.

When there are debugging symbols associated with the PC value, these tell
gdb everything it needs to know about what that particular function puts
into its call frame.  It knows the names and types of the arguments, so it
knows what FP+n values to fetch and show you.  It knows which FP-n spots
hold which saved register values and which local variables.  The gdb
command "info frame" shows you where on the stack each item is stored.
When gdb shows you "frame #0", that means the actual contents of the thread
registers.  When you use the "up" command to look at the calling frame, it
is using this information to find the register values saved by the function
prologue of the frame #0 function, and populate its idea of what the
registers looked like in frame #1 before it made the call.  When you do
"up" again, it follows the debugging info for the function executing in
frame #1 to read the frame #1 stack space containing the values the
registers had in frame #2, and so on.  (That's why "info regs" at frames
other than #0 does a memory read.)


So, when you have a case like ours where the SP and PC have jumped off into
bonzoland, this whole procedure can't get bootstrapped.  But, if you can
manage to figure out somehow a pair of PC,FP values that corresponds to
some actual call frame, you can do (make sure you are at frame #0):

        (gdb) set $ebp = 123
        (gdb) set $pc = 456
        (gdb) info frame
        (gdb) bt

and gdb will do its thing.  The information for the PC=456,FP=123 frame
itself may be mostly garbage since they will be the thread's clobbered(?)
register values at the time it crashed, but examining the outer frames will
read saved values from the stack.

There are two approaches to figuring out likely PC and FP values.  First,
look at all the registers of the crashing thread.  If any of these are
valid pointers to text, data, or stack, they might give you a clue.
Obviously, PC, SP, and FP are the most useful ones, but we are here because
those ones are totally clobbered.  Any register value that is a valid text
address is a clue: a pointer to some function that was used somewhere; an
address in the middle of a function might be a return address of a recent
caller that was on the stack and got restored into a register during the
pathology.  While executing a function that's in a shared library (actually
anything compiled with PIC), %ebx contains a pointer to the library's GOT.
You can see from "info shared" what ranges of addresses belong to the
text/data of what libraries, and "info files" shows what ranges belong to
the executable's text/data.

Any register value that is a pointer into the thread's stack is likely to
be near where the top of the stack was, so we can look at the data around
such an address.  You can also just look at the whole stack area of the
thread and guess where you think the last-used top was.  Be wary of what
you see, since there might be areas of stack containing entirely valid call
frames of old calls that returned a long time ago, and the stack just never
got deep enough on later calls to reuse that space.

I would examine likely areas of the stack with "x/100a ADDR" and the like,
just looking for any valid text addresses anywhere that give me a clue
where to start hunting.

Note that mach_msg_server_timeout allocates two big message buffer areas in
its call frame with alloca, so you'll need to skip over that if going from
the outermost frames in towards the demux function.  This is an assert that
certainly should be there:

RCS file: /cvs/glibc/libc/mach/msgserver.c,v
retrieving revision 1.5
diff -u -b -p -r1.5 msgserver.c
--- msgserver.c     1996/12/20 01:32:35 1.5
+++ msgserver.c     2001/04/03 01:06:45
@@ -116,6 +116,7 @@ __mach_msg_server_timeout (boolean_t (*d
        Pass it to DEMUX for processing.  */
 
          (void) (*demux) (&request->Head, &reply->Head);
+           assert (reply->Head.msgh_size <= max_size);
 
          switch (reply->RetCode)
              {

If the demux function wrote past the end of the reply buffer, that would
certainly corrupt the stack.  But from your traces we know that all the
message sizes were small, so that is not our problem here.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Reboots?, (continued)
- Re: Reboots?, Roland McGrath, 2001/04/01
  - Re: Reboots?, Marcus Brinkmann, 2001/04/01
    - Re: Reboots?, Roland McGrath, 2001/04/01
    - Re: Reboots?, Marcus Brinkmann, 2001/04/01
    - Re: Reboots?, Marcus Brinkmann, 2001/04/01
    - Re: Reboots?, Marcus Brinkmann, 2001/04/02
    - Re: Reboots?, Marcus Brinkmann, 2001/04/02
    - Re: Reboots?, Marcus Brinkmann, 2001/04/02
    - Re: Reboots?, Roland McGrath <=
  - Re: Reboots?, Marcus Brinkmann, 2001/04/01
    - Re: Reboots?, Roland McGrath, 2001/04/01
Prev by Date: Snowhite and the Seven Dwarfs - The REAL story!
Next by Date: Re: Reboots?
Previous by thread: Re: Reboots?
Next by thread: Re: Reboots?
Index(es):
- Date
- Thread