Re: [lwip-users] Potential bug in tcp retransmission handling causes dea

lwip-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] Potential bug in tcp retransmission handling causes dea

From:	Kieran Mansley
Subject:	Re: [lwip-users] Potential bug in tcp retransmission handling causes deadlock
Date:	Fri, 05 Sep 2008 17:12:03 +0100

On Fri, 2008-09-05 at 17:56 +0200, address@hidden wrote:
> Hi there
> 
> Kieran asked me to do further investigations considering the topic
> "Deadlocked tcp_retransmit due to exceeded pcb->cwnd" (see
> http://lists.gnu.org/archive/html/lwip-users/2008-07/msg00098.html).

Thanks for taking the time to produce such a detailed and helpful
analysis.

> 9. We get the third dupack for 10085. According to RFC2581 we shall
> start a fast retransmission now
> 10. For fast retransmission tcp_process() calls tcp_receive() calls
> tcp_rexmit() calls tcp_output()
> 11. Cause tcp_output() was invoked by an initial tcp_input() it bailes
> out on
>     if (tcp_input_pcb == pcb)
>     ==> !!! This violates RFC2581 IMHO !!!

Yes, that's a problem.  We'll need to fix that somehow.

> 12. But tcp_rexmit() already tinkered our queues by placing the first
> unacked segment to the
>     unsent queue.
>     unsent->10085
>     unacked->11453->12818->14183
> 13. The next few ouput attempts bail out in tcp_output() due to the
> nagle algorith
>     (tcp_do_output_nagle()). Thus nothing more hapens till a
> retransmission timeout occurs
> 14. tcp_slowtmr() requires a retransmission (pcb->rtime >= pcb->rto).
> This shrinks down the
>     congestion window to the maximum segment size (1390 in my case).
>     BTW: A retransmission is triggered by segment 14183 and not by
> 10085 in this case
>     which is an aftereffect of the underlying bug IMHO.
> 15. tcp_slowtmr() calls tcp_rexmit_rto(). The rto function moves all
> unacked segments to the head
>     of the unsent queue. This is final step causing the deadlock in
> tcp_output() cause the
>     smallest sequence number is now at the end of the queue.
>     Unsent->11453->12818->14183->10085.

And that looks to be the fundamental cause of this bug.  

> I needed a quick fix for our project and therefore I reordered the
> queue in tcp_output before the
> While loop in tcp_output. However this is just a quick fix to fight
> the symptoms. Therefore I ask
> for other suggestions or perhaps a patch.

Your solution isn't that bad, other than the amount of CPU required to
sort the queue.  I'll try and find some time to look at this though and
come up with something better.  Other people's suggestions are always
welcome though as time is often in short supply!

If there's a bug open for this, could you add these details to the bug.
If there's no open bug, could you open one!

Thanks

Kieran

[Prev in Thread]

Current Thread

[Next in Thread]

[lwip-users] Potential bug in tcp retransmission handling causes deadlock, hajot, 2008/09/05
- Re: [lwip-users] Potential bug in tcp retransmission handling causes deadlock, Kieran Mansley <=
  - Re: [lwip-users] Potential bug in tcp retransmission handling causes deadlock, address@hidden, 2008/09/06

Prev by Date: [lwip-users] Potential bug in tcp retransmission handling causes deadlock
Next by Date: Re: [lwip-users] Potential bug in tcp retransmission handling causes deadlock
Previous by thread: [lwip-users] Potential bug in tcp retransmission handling causes deadlock
Next by thread: Re: [lwip-users] Potential bug in tcp retransmission handling causes deadlock
Index(es):
- Date
- Thread