emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#22152: closed (fat_mutex owner corruption (fmoc) i


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#22152: closed (fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11))
Date: Mon, 20 Jun 2016 20:06:02 +0000

Your message dated Mon, 20 Jun 2016 16:05:01 -0400
with message-id <address@hidden>
and subject line Re: bug#22152: fat_mutex owner corruption (fmoc) inside 
fat_mutex_unlock (guile-v2.0.11)
has caused the debbugs.gnu.org bug report #22152,
regarding fat_mutex owner corruption (fmoc) inside fat_mutex_unlock 
(guile-v2.0.11)
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
22152: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22152
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11) Date: Sat, 12 Dec 2015 19:28:14 +0200
Hi

We sporadically get "mutex not locked" and "mutex not locked by current thread"
exceptions on Solaris 10u10 with guile-2.0.11.

This problem can be reproduced with following scheme scripts:



  guile-fmoc-test-mnl.scm (for "mutex not locked")
      Two threads, one of them (reader-000) waits on a condition variable that
      nothing will trigger and the other thread (writer-000) locks and unlocks
      the mutex used with the condition variable. This code causes "mutex not
      locked" exception with some consistency.

      Output for this is in guile-fmoc-test-mnl-problem_output.txt (referenced
      as mnl-problem_output.txt)


      * owner id for reader-000 is 12593872

      * owner id for writer-000 is 12595040



  guile-fmoc-test-mnlbct.scm (for "mutex not locked by current thread")
      Same as guile-fmoc-test-mnl.scm except here writer-000 signals condition
      variable. This code causes "mutex not locked" and "mutex not locked by
      current thread" errors.

      Output for this, showing "mutex not locked by current thread", is in
      guile-fmoc-test-mnlbct-problem_output.txt (referenced as mnlbct-
      problem_output.txt)


      * owner id for reader-000 is 14535648

      * owner id for writer-000 is 14536400



To track down this issue we have added some debug printfs (see guile-2.0.11-
with-debug.patch). Given that this changes the line numbers I have referenced
original line numbers as o:file:line and line numbers with patch as d[:file]:
line. Also, these printfs have resulted in some irrelevant output (for internal
and verbose logging mutexes) which has been filtered out.

There is various scenarios leading to these errors that we have found but all
caused by same problem. The a detailed analysis for "mutex not locked by
current thread" scenario that can be seen in mnlbct-problem_output.txt is
included below and detailed analysis for other scenarios will be shared if
required.

Scenario from mnlbct-problem_output.txt:


  1. [writer-000:14536400] unlocks fat_mutex[14512880] and queues reader-000:
     14535648

     at o:threads.c:1664 - d:1681

     before mnlbct-problem_output.txt:4079

  2. [reader-000:14535648] locks fat_mutex[14512880] with fat_mutex_lock

     at o:threads.c:1394 - d:1401

     before mnlbct-problem_output.txt:4080

  3. [reader-000:14535648] enters wait-condition-variable and changes fat_mutex
     [14512880].owner to writer-000:14536400

     at o:threads.c:1616 - d:1631

     before mnlbct-problem_output.txt:4083

  4. [reader-000:14535648] goes into block_self and starts waiting on ptheead
     condition variable inside fat_mutex_unlock &gt; block_self. This unlocks
     fat_mutex[14512880].mutex, allowing some other thread to lock fat_mutex
     [14512880]

     at o:threads.c:452 - d:456

     before mnlbct-problem_output.txt:4083

  5. [writer-000:14536400] locks fat_mutex[14512880] with fat_mutex_lock

     at o:threads.c:1394 - d:1401

     before mnlbct-problem_output.txt:4082

  6. [reader-000:14535648] spurious wake-up occurs for condition variable which
     causes block_self to return EINTR to fat_mutex_unlock

     at o:threads.c:1621 - d:1636

     before mnlbct-problem_output.txt:4084

  7. [reader-000:14535648] loops and sets fat_mutex[14512880].owner=4 (i.e. not
     locked) while writer-000:14536400 should still be owner of fat_mutex
     [14512880]. Since it was spurious wake-up reader-000:14535648 continues to
     wait for condition to be notified again.

     at o:threads.c:1616 - d:1631

     before mnlbct-problem_output.txt:4086

  8. [writer-000:14536400] completes signal-condition-variable

     before mnlbct-problem_output.txt:4088

  9. [reader-000:14535648] now gets actual notification and block_self returns
     0. This causes fat_mutex:14512880 to be locked again - which works cos
     fat_mutex:14512880.owner is 4. This changes fat_mutex:14512880.owner from
     4 to 14535648.

     at o:threads.c:1643 - d:1660

     before mnlbct-problem_output.txt:4086

 10. [writer-000:14536400] tries to unlock the mutex, this fails though as
     reader-000:14535648 now owns the mutex - resulting in "mutex not locked by
     current thread" exception.

     at o:threads.c:1599 - d:1614

     before mnlbct-problem_output.txt:4089


Briefly, for mnl-problem_output.txt:


  1. reader-000 locks fat_mutex and unlocks it again as it starts waiting for
     condition to be notified.

  2. writer-000 locks mutex

  3. Spurious wake-up occurs for reader-000 which causes reader to change
     fat_mutex.owner from writer-000 id to 4 and then resumes waiting on
     condition variable

  4. Writer tries to unlock fat_mutex but now owner is 4 and this results in
     "mutex not locked" exception


The cause of these problems seems to be related to fat_mutex_unlock changing
fat_mutex.owner inside the while loop that is intended for checking condition
variable predicate which is problematic if spurious wake-ups from
pthread_cond_wait occur. Spurious wake-ups from pthread_cond_wait seems less
common on Linux, which is why we have only been observing the issue on Solaris.
It does however look like this problem will occur on any platform when a
spurious wake-up does occur.

As far as we can tell there is no reason for the fat_mutex.owner assignment to
happen inside the loop. It seems more appropriate that this happens only once
before the loop and not again. To this extent we moved owner reassignment out
of the loop and this seems to have resolved our issues. The patch for this is
in guile-2.0.11-with-fmoc_fix.patch.

We have ran the test suite with this on Linux and everything passes. There is
however other issues with test suite on Solaris that prevents it from
completing (both with and without the patch) which needs further investigation.


* All files related to this can be found at
https://gitlab.com/concurrent-systems/osp-issues-1512/tree/master/guile-fmoc

* Source with guile-2.0.11-with-debug.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-debug

* Source with guile-2.0.11-with-fmoc_fix.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-fmoc_fix


Regards
-- 
Iwan Aucamp

Attachment: guile-2.0.11-with-debug.patch
Description: Text Data

Attachment: guile-2.0.11-with-fmoc_fix.patch
Description: Text Data

Attachment: guile-fmoc-test-mnl.scm
Description: Text Data

Attachment: guile-fmoc-test-mnlbct.scm
Description: Text Data

Attachment: guile-fmoc-test-mnlbct-problem_output.txt.bz2
Description: BZip2 compressed data

Attachment: guile-fmoc-test-mnl-problem_output.txt.bz2
Description: BZip2 compressed data


--- End Message ---
--- Begin Message --- Subject: Re: bug#22152: fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11) Date: Mon, 20 Jun 2016 16:05:01 -0400 User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.95 (gnu/linux)
Mark H Weaver <address@hidden> writes:

> Iwan Aucamp <address@hidden> writes:
>> We sporadically get "mutex not locked" and "mutex not locked by current 
>> thread"
>> exceptions on Solaris 10u10 with guile-2.0.11.
>
> Thanks very much for your detailed analysis and proposed fix.
>
> I've attached a patch that hopefully fixes this bug and also refactors
> the code to hopefully be somewhat more clear.  Can you please test it on
> Solaris and verify that it works for your use cases?

I went ahead and pushed commit 1e86dc32a42af549fc9e4721ad48cdd7d296c042
to stable-2.0, which will soon become guile-2.0.12.  I hope it fixes the
issue, although unfortunately my patch was never tested on Solaris.  I'm
going to close this bug, but feel free to reopen it if there are still
issues.

     Thanks,
       Mark


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]