[lwip-devel] [bug #46321] Synchronization bug around lwip

lwip-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lwip-devel] [bug #46321] Synchronization bug around lwip_select() and t

From:	Mikhail Pankov
Subject:	[lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores
Date:	Thu, 29 Oct 2015 10:07:15 +0000
User-agent:	Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0

URL:
  <http://savannah.nongnu.org/bugs/?46321>

                 Summary: Synchronization bug around lwip_select() and
tcpip_thread() with thread-local semaphores
                 Project: lwIP - A Lightweight TCP/IP stack
            Submitted by: pankov_m
            Submitted on: Thu 29 Oct 2015 10:07:14 AM GMT
                Category: sockets/netconn
                Severity: 3 - Normal
              Item Group: Crash Error
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
         Planned Release: 
            lwIP version: git head

    _______________________________________________________

Details:

We've hit a pretty obscure synchronization bug in LwIP.

The summary follows.

lwip_select() may leave thread-local semaphore in inconsistent state if new
packets arrive between the 2 calls to lwip_selscan() in the function. Thread
switch may happen in the meantime and semaphore will be signaled before it's
being waited on. Moreover, it will never be waited on as second lwip_selscan()
will detect new packets and choose not to wait on semaphore at all. This same
semaphore is then used in many other places, for example, in tcpip_apimsg.
That causes corruption of stack-allocated API message. This happens because
the user thread calling the LwIP thread that does tcpip_thread() should wait
on that same thread-local semaphore, but won't, as it's already signaled.

The observed behavior is with LWIP_NETCONN_SEM_PER_THREAD defined.

We have actually observed the crash resulting from described behavior. It
depends on scheduling of threads a lot, but it did happen after hours of work
in multi-threaded environment. When I write "we switch from thread 1 to thread
2" below, I mean it happens _sometimes_, very rarely actually.

Please bear with me as detailed description is non-trivial.

Let's look at lwip_select:


#if LWIP_NETCONN_SEM_PER_THREAD
    select_cb.sem = LWIP_NETCONN_THREAD_SEM_GET();
#else /* LWIP_NETCONN_SEM_PER_THREAD */
    if (sys_sem_new(&select_cb.sem, 0) != ERR_OK) {
      /* failed to create semaphore */
      set_errno(ENOMEM);
      return -1;
    }
#endif /* LWIP_NETCONN_SEM_PER_THREAD */

...

#if !LWIP_NETCONN_SEM_PER_THREAD
    sys_sem_free(&select_cb.sem);
#endif /* LWIP_NETCONN_SEM_PER_THREAD */



With LWIP_NETCONN_SEM_PER_THREAD defined, we use thread-local semaphores.

Now, consider a situation. We performed the selscan in the beginning of the
function:


nready = lwip_selscan(maxfdp1, readset, writeset, exceptset, &lreadset,
&lwriteset, &lexceptset);


and added ourselves to the list of waiting threads.

Then, while we are checking the validity of FDSETs


    for (i = LWIP_SOCKET_OFFSET; i < maxfdp1; i++) {
    ...


we receieve the packet and that causes a thread switch to the thread doing
event_callback().

We are now on the list, and the packet is for the socket that's being waited
on by user's select(). We go ahead and signal the semaphore:


        sys_sem_signal(SELECT_SEM_PTR(scb->sem));


Then we switch back to thread in lwip_select(), that had nothing ready (thus
nready == 0):


    if (nready >= 0) {
      /* Call lwip_selscan again: there could have been events between
         the last scan (without us on the list) and putting us on the list!
*/
      nready = lwip_selscan(maxfdp1, readset, writeset, exceptset, &lreadset,
&lwriteset, &lexceptset);


We do lwip_selscan again and discover we have packets for our socket that
arrived after the previous scan and before this one. Our semaphore is already
signalled!

Now we get nready > 0 and don't go in this block:


      if (!nready) {
        ...

        waitres = sys_arch_sem_wait(SELECT_SEM_PTR(select_cb.sem),
msectimeout);
      }


So we don't wait on the semaphore in this select() call and leave. The
semaphore, being thread-local, is left signaled. When we enter any other
function from user thread, that has to wait on thread-local semaphore, it's
already signaled, and thread continues immediately, causing havoc.

In our particular example, the next function user thread happened to enter was
tcpip_apimsg().


err_t
tcpip_apimsg(struct api_msg *apimsg)
{
  TCPIP_MSG_VAR_DECLARE(msg);
#ifdef LWIP_DEBUG
  /* catch functions that don't set err */
  apimsg->msg.err = ERR_VAL;
#endif
 
  if (sys_mbox_valid_val(mbox)) {
    TCPIP_MSG_VAR_ALLOC(msg);
    TCPIP_MSG_VAR_REF(msg).type = TCPIP_MSG_API;
    TCPIP_MSG_VAR_REF(msg).msg.apimsg = apimsg;
#if LWIP_NETCONN_SEM_PER_THREAD
    apimsg->msg.op_completed_sem = LWIP_NETCONN_THREAD_SEM_GET();
    LWIP_ASSERT("netconn semaphore not initialized",
      apimsg->msg.op_completed_sem != SYS_SEM_NULL);
#endif
    sys_mbox_post(&mbox, &TCPIP_MSG_VAR_REF(msg));
    sys_arch_sem_wait(LWIP_API_MSG_SEM(&apimsg->msg), 0);
    TCPIP_MSG_VAR_FREE(msg);
    return apimsg->msg.err;
  }
  return ERR_VAL;
}


As you can see, we fetch thread-local semaphore


    apimsg->msg.op_completed_sem = LWIP_NETCONN_THREAD_SEM_GET();


that is already signaled, as you remember.

Then the thread posts the mbox and, instead of waiting, immediately continues
and exits the function, thus trashing stack-allocated msg.

The thread that processes the mbox tries to dispatch the message


static void
tcpip_thread(void *arg)
{
    ...
    switch (msg->type) {
       ...


and falls through as type is garbage


    default:
      LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: invalid message: %d\n",
msg->type));
      LWIP_ASSERT("tcpip_thread: invalid message", 0);
      break;
    }


So, this sequence of events leads to hard crash of LwIP due to assert.

We have worked around this problem by not using thread-local semaphore in
lwip_select() and using usual function-local stack semaphore instead. See the
below patch (it's also attached):

diff --git a/contrib/src/api/sockets.c b/contrib/src/api/sockets.c
index 165fbba..f595fad 100644
--- a/contrib/src/api/sockets.c
+++ b/contrib/src/api/sockets.c
@@ -211,13 +211,13 @@ struct lwip_sock {
   SELWAIT_T select_waiting;
 };
 
-#if LWIP_NETCONN_SEM_PER_THREAD
-#define SELECT_SEM_T        sys_sem_t*
-#define SELECT_SEM_PTR(sem) (sem)
-#else /* LWIP_NETCONN_SEM_PER_THREAD */
 #define SELECT_SEM_T        sys_sem_t
 #define SELECT_SEM_PTR(sem) (&(sem))
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
 
 /** Description for a task waiting in select */
 struct lwip_select_cb {
@@ -1226,15 +1226,15 @@ lwip_select(int maxfdp1, fd_set *readset, fd_set
*writeset, fd_set *exceptset,
     select_cb.writeset = writeset;
     select_cb.exceptset = exceptset;
     select_cb.sem_signalled = 0;
-#if LWIP_NETCONN_SEM_PER_THREAD
-    select_cb.sem = LWIP_NETCONN_THREAD_SEM_GET();
-#else /* LWIP_NETCONN_SEM_PER_THREAD */
     if (sys_sem_new(&select_cb.sem, 0) != ERR_OK) {
       /* failed to create semaphore */
       set_errno(ENOMEM);
       return -1;
     }
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
 
     /* Protect the select_cb_list */
     SYS_ARCH_PROTECT(lev);
@@ -1334,9 +1334,9 @@ lwip_select(int maxfdp1, fd_set *readset, fd_set
*writeset, fd_set *exceptset,
     select_cb_ctr++;
     SYS_ARCH_UNPROTECT(lev);
 
-#if !LWIP_NETCONN_SEM_PER_THREAD
     sys_sem_free(&select_cb.sem);
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
 
     if (nready < 0) {
       /* This happens when a socket got closed while waiting */


With these changes, LwIP works for us. Whether this is a proper fix is still
questionable, however.



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Thu 29 Oct 2015 10:07:14 AM GMT  Name: lwip-assert-tls-sem.patch  Size:
3kB   By: pankov_m
patch to work around the problem
<http://savannah.nongnu.org/bugs/download.php?file_id=35329>

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?46321>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores, Mikhail Pankov <=
- [lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores, Simon Goldschmidt, 2015/10/29
  - [lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores, Simon Goldschmidt, 2015/10/29
    - [lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores, Joel Cunningham, 2015/10/29

Prev by Date: [lwip-devel] [task #13780] lcp_echo_interval maximum value
Next by Date: [lwip-devel] [PATCH] httpd_raw: fix possible NULL pointer dereference
Previous by thread: [lwip-devel] [bug #46290] Optimize passing contiguous buffers to tcp_write
Next by thread: [lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores
Index(es):
- Date
- Thread