[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] let-location
Re: [Chicken-users] let-location
Tue, 13 Jun 2006 18:35:03 +0200
On 6/12/06, Kon Lovett <address@hidden> wrote:
There does seem to be a problem here. I can't see anything obvious in
the compiler C output, but then I have trouble at the best of times
deciphering it ;-).
I have attached a simple variant of Nico's example. Just uncommenting
one of the 'print' expressions will "remove" the problem. I am
guessing a minor gc is invoked due to the 'print'.
Oh boy, that was a tough one. I will push a fix to darcs/subversion in the next
minutes. It turned out to be a GC/locative-handling bug. If you have a certain
perverse fascination for hard-core debugging war stories, read on...
So it seems that this was GC-related: in some situations apparently
pointers to locations or the locations
themselves were moved around during GC, with stale pointers pointing
to dead or invalid storage.
First I set a breakpoint in the C function that is the compiled
version of `##sys#signal-hook' (if you're interested -- it's
library.c:f_12708), in the hope
of going through the backtrace and finding some bug in the
location-setup code. My first idea was that the
code was wrong and that pointers to locations where passed around but
not saved properly in case GC
kicks in. But the code looks fine and the test program is small. All
locations and locatives are created
correctly and passed to continuations and foreign stubs properly.
If the code is correct, the runtime system must be broken. Next I
reviewed the locative-creation code
and the code that updates locatives after GC. Panic rises: I hadn't
looked at this code for years, I had
absolutely no idea how it works. It is executed during gc, in 3
different modes, in a situation
where the heap is in an inconsistent state, and updates locative
objects by adusting pointer-values to
data that moved in the heap, or from the stack into the heap.
To get around looking at this code too closely, I saved the pointers
passed to the C function (the one
that modified the locations) in a global array to look at the
pointed-at storage: perhaps the bytevector
that holds the actual bytes of a location hasn't been moved. Running
the code again shows and breaking in
`##sys#signal-hook', moving upward in the call-chain to the point
where the results are checked and
comparing the saved pointers in the global array with the pointers
that are consulted during the
result-check (and which have been properly moved during GC) showed
that one pointer indeed was wrong:
it pointed to a stale instance of the original `long'-type location.
So the locative was not properly
updated, the pointed-at value moved during GC and the pointer in the
locative which should have been updated
to point to the new location (+ offset) was not properly adjusted.
Massive head-scratching and no idea how to go on.
Systematic asessment of the facts and symptoms didn't help at all.
Next day: peeked around more. Through luck I saw that locative-table
entry #2600 held the locative that
later became invalid, so I checked the locative-update code again for
off-by-one errors, as #2600 was the
last entry in the table. But it was ok. Well, I added a small
optimization during the bug-hunt, which
of course introduced the exact same off-by-one error I was looking
for...) But the original problem was
still unsolved and the locative table was fully processed.
Tried to set conditional breakpoints when locative-table entry #2600
is created or updated, but the program
then runs so slowly (and the bug appears after thousands of iterations
and 5 or 6 major GCs) that I would
have to wait too long. Miscellaneous attempts to reduce the number of
iterations or set the condition
breakpoint at locations that are not executed so often didn't help.
Then I set a watchpoint at locative-table entry #2600, enabled after 5
major GCs are run. It is still
relatively slow but it halts correctly at locative-creation and
locative-update. Continuing a few steps
(and repeating the whole thing multiple times) brought me to the point
where I understood the locative-
updating code well enough that I could pinpoint where #2600 is
updated: the code that checks the locative
for forwarding during GC (i.e. the move to the other semispace) and
the subsequent forwarding-check for
the pointed-at object.
In the end it was a "GC double forwarding" bug: objects may be
forwarded twice during GC: once from
the stack to the fromspace (the used semispace) and, if fromspace is
full, to tospace (the free semispace).
So the forwarding check (whether the object has its
C_GC_FORWARDING_BIT set) and the extraction of the
forwarding pointer to the new instance has to be done twice. I did
that for the locative itself, but not
for the pointed-at object. This caused the location to which the
locative pointed being stale (still
in the old fromspace), the pointer got passed to thw C function and
the stale storage was modified.
The storage used for comparing the result was live yet and pointed to
the proper location, but was
of course not modified.