[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lwip-devel] [bug #46321] Synchronization bug around lwip_select() and t
From: |
Mikhail Pankov |
Subject: |
[lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores |
Date: |
Thu, 29 Oct 2015 10:07:15 +0000 |
User-agent: |
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0 |
URL:
<http://savannah.nongnu.org/bugs/?46321>
Summary: Synchronization bug around lwip_select() and
tcpip_thread() with thread-local semaphores
Project: lwIP - A Lightweight TCP/IP stack
Submitted by: pankov_m
Submitted on: Thu 29 Oct 2015 10:07:14 AM GMT
Category: sockets/netconn
Severity: 3 - Normal
Item Group: Crash Error
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Planned Release:
lwIP version: git head
_______________________________________________________
Details:
We've hit a pretty obscure synchronization bug in LwIP.
The summary follows.
lwip_select() may leave thread-local semaphore in inconsistent state if new
packets arrive between the 2 calls to lwip_selscan() in the function. Thread
switch may happen in the meantime and semaphore will be signaled before it's
being waited on. Moreover, it will never be waited on as second lwip_selscan()
will detect new packets and choose not to wait on semaphore at all. This same
semaphore is then used in many other places, for example, in tcpip_apimsg.
That causes corruption of stack-allocated API message. This happens because
the user thread calling the LwIP thread that does tcpip_thread() should wait
on that same thread-local semaphore, but won't, as it's already signaled.
The observed behavior is with LWIP_NETCONN_SEM_PER_THREAD defined.
We have actually observed the crash resulting from described behavior. It
depends on scheduling of threads a lot, but it did happen after hours of work
in multi-threaded environment. When I write "we switch from thread 1 to thread
2" below, I mean it happens _sometimes_, very rarely actually.
Please bear with me as detailed description is non-trivial.
Let's look at lwip_select:
#if LWIP_NETCONN_SEM_PER_THREAD
select_cb.sem = LWIP_NETCONN_THREAD_SEM_GET();
#else /* LWIP_NETCONN_SEM_PER_THREAD */
if (sys_sem_new(&select_cb.sem, 0) != ERR_OK) {
/* failed to create semaphore */
set_errno(ENOMEM);
return -1;
}
#endif /* LWIP_NETCONN_SEM_PER_THREAD */
...
#if !LWIP_NETCONN_SEM_PER_THREAD
sys_sem_free(&select_cb.sem);
#endif /* LWIP_NETCONN_SEM_PER_THREAD */
With LWIP_NETCONN_SEM_PER_THREAD defined, we use thread-local semaphores.
Now, consider a situation. We performed the selscan in the beginning of the
function:
nready = lwip_selscan(maxfdp1, readset, writeset, exceptset, &lreadset,
&lwriteset, &lexceptset);
and added ourselves to the list of waiting threads.
Then, while we are checking the validity of FDSETs
for (i = LWIP_SOCKET_OFFSET; i < maxfdp1; i++) {
...
we receieve the packet and that causes a thread switch to the thread doing
event_callback().
We are now on the list, and the packet is for the socket that's being waited
on by user's select(). We go ahead and signal the semaphore:
sys_sem_signal(SELECT_SEM_PTR(scb->sem));
Then we switch back to thread in lwip_select(), that had nothing ready (thus
nready == 0):
if (nready >= 0) {
/* Call lwip_selscan again: there could have been events between
the last scan (without us on the list) and putting us on the list!
*/
nready = lwip_selscan(maxfdp1, readset, writeset, exceptset, &lreadset,
&lwriteset, &lexceptset);
We do lwip_selscan again and discover we have packets for our socket that
arrived after the previous scan and before this one. Our semaphore is already
signalled!
Now we get nready > 0 and don't go in this block:
if (!nready) {
...
waitres = sys_arch_sem_wait(SELECT_SEM_PTR(select_cb.sem),
msectimeout);
}
So we don't wait on the semaphore in this select() call and leave. The
semaphore, being thread-local, is left signaled. When we enter any other
function from user thread, that has to wait on thread-local semaphore, it's
already signaled, and thread continues immediately, causing havoc.
In our particular example, the next function user thread happened to enter was
tcpip_apimsg().
err_t
tcpip_apimsg(struct api_msg *apimsg)
{
TCPIP_MSG_VAR_DECLARE(msg);
#ifdef LWIP_DEBUG
/* catch functions that don't set err */
apimsg->msg.err = ERR_VAL;
#endif
if (sys_mbox_valid_val(mbox)) {
TCPIP_MSG_VAR_ALLOC(msg);
TCPIP_MSG_VAR_REF(msg).type = TCPIP_MSG_API;
TCPIP_MSG_VAR_REF(msg).msg.apimsg = apimsg;
#if LWIP_NETCONN_SEM_PER_THREAD
apimsg->msg.op_completed_sem = LWIP_NETCONN_THREAD_SEM_GET();
LWIP_ASSERT("netconn semaphore not initialized",
apimsg->msg.op_completed_sem != SYS_SEM_NULL);
#endif
sys_mbox_post(&mbox, &TCPIP_MSG_VAR_REF(msg));
sys_arch_sem_wait(LWIP_API_MSG_SEM(&apimsg->msg), 0);
TCPIP_MSG_VAR_FREE(msg);
return apimsg->msg.err;
}
return ERR_VAL;
}
As you can see, we fetch thread-local semaphore
apimsg->msg.op_completed_sem = LWIP_NETCONN_THREAD_SEM_GET();
that is already signaled, as you remember.
Then the thread posts the mbox and, instead of waiting, immediately continues
and exits the function, thus trashing stack-allocated msg.
The thread that processes the mbox tries to dispatch the message
static void
tcpip_thread(void *arg)
{
...
switch (msg->type) {
...
and falls through as type is garbage
default:
LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: invalid message: %d\n",
msg->type));
LWIP_ASSERT("tcpip_thread: invalid message", 0);
break;
}
So, this sequence of events leads to hard crash of LwIP due to assert.
We have worked around this problem by not using thread-local semaphore in
lwip_select() and using usual function-local stack semaphore instead. See the
below patch (it's also attached):
diff --git a/contrib/src/api/sockets.c b/contrib/src/api/sockets.c
index 165fbba..f595fad 100644
--- a/contrib/src/api/sockets.c
+++ b/contrib/src/api/sockets.c
@@ -211,13 +211,13 @@ struct lwip_sock {
SELWAIT_T select_waiting;
};
-#if LWIP_NETCONN_SEM_PER_THREAD
-#define SELECT_SEM_T sys_sem_t*
-#define SELECT_SEM_PTR(sem) (sem)
-#else /* LWIP_NETCONN_SEM_PER_THREAD */
#define SELECT_SEM_T sys_sem_t
#define SELECT_SEM_PTR(sem) (&(sem))
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
/** Description for a task waiting in select */
struct lwip_select_cb {
@@ -1226,15 +1226,15 @@ lwip_select(int maxfdp1, fd_set *readset, fd_set
*writeset, fd_set *exceptset,
select_cb.writeset = writeset;
select_cb.exceptset = exceptset;
select_cb.sem_signalled = 0;
-#if LWIP_NETCONN_SEM_PER_THREAD
- select_cb.sem = LWIP_NETCONN_THREAD_SEM_GET();
-#else /* LWIP_NETCONN_SEM_PER_THREAD */
if (sys_sem_new(&select_cb.sem, 0) != ERR_OK) {
/* failed to create semaphore */
set_errno(ENOMEM);
return -1;
}
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
/* Protect the select_cb_list */
SYS_ARCH_PROTECT(lev);
@@ -1334,9 +1334,9 @@ lwip_select(int maxfdp1, fd_set *readset, fd_set
*writeset, fd_set *exceptset,
select_cb_ctr++;
SYS_ARCH_UNPROTECT(lev);
-#if !LWIP_NETCONN_SEM_PER_THREAD
sys_sem_free(&select_cb.sem);
-#endif /* LWIP_NETCONN_SEM_PER_THREAD */
if (nready < 0) {
/* This happens when a socket got closed while waiting */
With these changes, LwIP works for us. Whether this is a proper fix is still
questionable, however.
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Thu 29 Oct 2015 10:07:14 AM GMT Name: lwip-assert-tls-sem.patch Size:
3kB By: pankov_m
patch to work around the problem
<http://savannah.nongnu.org/bugs/download.php?file_id=35329>
_______________________________________________________
Reply to this item at:
<http://savannah.nongnu.org/bugs/?46321>
_______________________________________________
Message sent via/by Savannah
http://savannah.nongnu.org/
- [lwip-devel] [bug #46321] Synchronization bug around lwip_select() and tcpip_thread() with thread-local semaphores,
Mikhail Pankov <=