lwip-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lwip-users] 1.4 rc1 non-blocking issues


From: Yoav Nissim
Subject: [lwip-users] 1.4 rc1 non-blocking issues
Date: Tue, 30 Nov 2010 18:15:23 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4


Hi all.


We are in the process of testing 1.4 rc1 with sockets in non-blocking
mode and have been experiencing some problems.

Looking at the code we've found some issues which we would like to raise
here and hopefully get some feedback on.


1. ERR_WOULDBLOCK is treated as a FATAL error - it seems as if someone
forgot to update the ERR_IS_FATAL macro when the error code was added. A
non-blocking operation that sets the conn error to WOULDBLOCK (e.g
send() and recv() ) renders the socket unusable. Our workaround was to
use ERR_WOULDBLOCK in the ERR_IS_FATAL macro instead of ERR_VAL.

2. As far as we know, EMSGSIZE is not a valid return code for send() on
a STREAM socket. netconn_write does not return the number of bytes
processed and cannot perform partial sends. This makes an application
that uses select run in tight loops since select returns writable, but
send [working on an all or nothing assumption] returns an error
(EWOULDBLOCK)

3. connect has several problems:

     a. connect sets sock->err to EINPROGRESS. When select returns
writable, getsockopt(SO_ERROR) will never let us know what happened [i.e
no access to conn->err] since getsockopt(SO_ERROR) does not return the
error value when sock->err is not 0 (it is set to EINPROGRESS). It seems
to me the non-blocking path lacks the propagation of the connect result
to sock->err (which does happen when using a blocking call).

     b. getsockopt(SO_ERROR) - behaviour according to Posix is to return
and clear the _pending_ error for the socket (if one exists). instead
getsockopt returns the last socket call error once. If additional calls
are made netconn's last error is returned repeatedly.

     c. if connect is called again while a previous non-blocking connect
is being processed, ERR_ISCONN is assigned to conn->err [which by the
way translates to an errno of -1]. Now, if the connection succeeds,
do_connected will not be able to set conn->err to ERR_OK since it checks
for ERR_INPROGRESS. To make things worse, ERR_ISCONN is treated as a
FATAL error, and will therefore render the socket unusable. According to
Posix, EALREADY should be returned while a connect is in progress, and
EISCONN should be returned when a socket is connected.

4. lwip_select seems to be susceptible to race conditions and has issued
many ASSERTs as well as crashed.

     a. closing a socket on which select is waiting will ASSERT when
select wakes up to find that tryget_socket returns NULL. The ASSERT
statements seem to indicate that this could never happen. Are we missing
something? this definitely crashes on our setup. Our workaround for this
is to ignore the missing socket and continue the for loop

     b. closing a socket should wake lwip_select. If I am not mistaken
it should be placed in the exceptset but I'm not sure. Simon, Kieran,
are there plans to implement this sometime in the future?

     c. considering (b) above, when a socket is closed and another
created while select is asleep, alloc_socket() tends to create it at the
same index. The result is a zeroed out select_waiting counter on a brand
new socket. When select wakes up, it does not know the socket has been
replaced; it will decrement the counter and ASSERT since select_waiting
is now negative. Our workaround for this is to zero out the
select_waiting member when closing a socket and adding a condition
before decrementing the counter. Additionally, alloc_socket() can be
modified to avoid allocating sockets where select_waiting is not 0, or
to allocate sockets using an incrementing index.

     d. Another crash has been occurring which we still do not fully
understand. It seems to happen when select wakes up while event_callback
is being processed - specifically when select_cb is removed from the
list exactly after event_callback leaves its critical section and
reenters it.


We know this is a handful, but would be grateful for any feedback on
these issues.
We will post any other findings if and when they become available.

Thanks,
Yoav, Tal & Aviad.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]