lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Debugging a segfault [Was: product editor feedback]


From: Greg Chicares
Subject: [lmi] Debugging a segfault [Was: product editor feedback]
Date: Sat, 31 Mar 2007 12:15:59 +0000
User-agent: Thunderbird 1.5.0.4 (Windows/20060516)

On 2007-3-2 3:13 UTC, Greg Chicares wrote:
> 
>   "Access Violation...Reading from location 00000001"

Our users have scant patience with segfaults. When we observe one,
we need to track down and eliminate its cause.

> By loading another file first, then loading a database file,
> I can get different outcomes: sometimes everything seems to
> work, while other times I observe messages like this:
> 
>   Assertion 'a_idx[j] < axis_lengths[j]' failed.
>   [file C:/lmi/checkouts/product-editor_branch/lmi/ihs_dbvalue.cpp, line 324]
> 
> (after which I needed to terminate the program manually), or:
> 
>   Trying to index database item with key 0 past end of data.
>   [file C:/lmi/checkouts/product-editor_branch/lmi/ihs_dbvalue.cpp, line 337]
> 
> (after which it abended in a way that 'drmingw' didn't catch).

Often the hardest task is to find a reproducible test case.
Here, we were lucky: loading the program and opening a file
reproduces the problem, more or less reliably. Wendy and I both
have been able to reproduce it several times in a row, but then
have lost that ability, more than once, without recompiling. In
my case, burning a CD caused the problem to "go away" for a
while, presumably because doing so changed large areas of RAM.
Then it comes back later.

Repeating those steps seems no longer to reproduce the problem if
I apply '20070316T1232Z_database_fix_crash.v03.patch', which I
have not yet done in HEAD. I don't know that it actually prevents
the underlying problem, because I don't know what the underlying
problem is. As in all such situations, it's conceivable that it
just masks the symptoms; that would be the worst possible outcome,
because we'd still have the problem and it would likely enough
reappear later in a way that we might find difficult to reproduce.
We need to search for the underlying cause.

> My hypothesis is that DatabaseTableAdapter::db_value_ is getting
> corrupted in some way that (when it crashes) eludes the sanity
> tests in TDBValue::operator[]().

Adding
  asm("int $3");
where assertions fired above, and rerunning under gdb...

Program received signal SIGTRAP, Trace/breakpoint trap.
0x011d7238 in TDBValue::operator[] () at shared_ptr.hpp:247
247             BOOST_ASSERT(px != 0);
Current language:  auto; currently c++
(gdb) bt
#0  0x011d7238 in TDBValue::operator[] () at shared_ptr.hpp:247
#1  0x00417f6b in DatabaseTableAdapter::DoGetValue (this=0x3262b08,
    address@hidden) at C:/lmi/src/lmi/database_view_editor.cpp:241
#2  0x0042fe85 in MultiDimGrid::GetValue (this=0x3269988, row=0, col=0)
    at multidimgrid_any.hpp:434
#3  0x105162fd in wxGrid::GetCellValue (this=0x326b9d8, row=0, col=0)
    at ../include/wx/generic/grid.h:1483

...tells us something about how the problem arises.

> I wouldn't claim that those [sanity]
> tests are comprehensive.

Therefore, it is sensible to make them more comprehensive. That
is, an implicit (not asserted) precondition is violated, and a
segfault ensues, so we need to assert some more preconditions.
The immediate goal is to replace the segfault with an assertion
failure from which users can recover, and which helps us find
the cause.

On 20070330T0013Z I added a verbose test to 'ihs_dbvalue.cpp'.
Here are some of the diagnostics it produces:

Trying to index database with key -755914244: e_number_of_axes is 7, and 
axis_lengths.size() is 3839906283, but those quantities must be equal.

Trying to index database with key 1: e_number_of_axes is 7, and 
axis_lengths.size() is 4281947065, but those quantities must be equal.

Trying to index database with key 0: e_number_of_axes is 7, and 
axis_lengths.size() is 0, but those quantities must be equal.

Trying to index database with key 0: e_number_of_axes is 7, and 
axis_lengths.size() is 32769, but those quantities must be equal.

The pending patch, AIUI, would avoid calling this function for
non-leaf nodes, such as keys 0 and 1 above. We have diagnostics
that refer to other keys, though. There exist only a few hundred
keys, numbered sequentially from zero, so -755914244 is not a
plausible key. If this function were called only on sane objects,
then we might conclude that the patch does not remove the entire
problem. But I don't reach that conclusion because the objects do
not look sane.

The length of this vector
    std::vector<int>    axis_lengths;
is sometimes close to 0x07FFF or 0x0FFFFFFFF. The program was not
actually allocating anything like 0x0FFFFFFFF bytes of RAM: I was
monitoring memory usage and would certainly have noticed that.
This vector seems to be an insane object such as could arise from
use of uninitialized memory, or a bad cast from a stray pointer;
I don't believe it could have resulted from running a ctor.

It makes sense to rule out initialization error. Class TDBValue
has no fewer than seven ctors, two of which didn't initialize
axis_lengths, so on 20070330T1009Z and 20070330T1403Z I remedied
that (even though default initialization to a length of zero
would have taken place). The problem is still reproducible: I
observe diagnostics like these:

Trying to index database with key 52499088: e_number_of_axes is 7, and 
axis_lengths.size() is 4281842414, but those quantities must be equal.

Trying to index database with key 52499264: e_number_of_axes is 7, and 
axis_lengths.size() is 0, but those quantities must be equal.

while Wendy reports these:

Trying to index database with key 0: e_number_of_axes is 7, and 
axis_lengths.size() is 0, but those quantities must be equal.

Trying to index database with key 0: e_number_of_axes is 7, and 
axis_lengths.size() is 4026793984, but those quantities must be equal.

Trying to index database with key 0: e_number_of_axes is 7, and 
axis_lengths.size() is 4026793984, but those quantities must be equal.

Now, that's the only diagnostic we're seeing. It seems likely
that the problem arises from invalid objects. I don't yet know
where they're coming from.

Evgeniy, do you have any insight into the cause? Making this
change to HEAD (database_view.cpp,v 1.12):

-    table_adapter().SetTDBValue(document().GetTDBValue(index));

     bool is_topic = tree.GetChildrenCount(event.GetItem());
+if(!is_topic)
+    table_adapter().SetTDBValue(document().GetTDBValue(index));
+else
+    table_adapter().SetTDBValue(NULL);

seems to prevent the problem, but why?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]