[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] Assert after dropped TX packet

From: Dittrich, Matthew
Subject: Re: [lwip-users] Assert after dropped TX packet
Date: Wed, 26 Oct 2011 11:51:33 -0500

I have verified that my MAC is dropping the tx frames during the "dead time".  
Today I put some debug output into my low_level_output() and lwIP is passing it 
in the frames, they are just not making it to the wire. So that's definitely on 
me (or the silicon! :) 

Yesterday I was fortunate to get a really clean capture, sorry to spam the list 
with pcaps when it's a driver issue, but I think the attached pcap highlights 
"something odd".  This capture is much easier to digest than the previous ones.

The problem is between frame #29 and #30, debug log line 463.  There are no 
issues immediately leading up to frame #29 (no nagle-ish activity like my 
previous pcaps), then my driver drops a frame twice in the "dead time", there a 
few tcp_write()'s after the first tx failure, after which the congestion window 
shrinks to MSS. All these tcp_write()'s seem to complete successfully.  A new 
PC app level request comes in after it's timeout with frame #30, lwip attempts 
to send the response data, but the "ntohl(seg->tcphdr->seqno) - pcb->lastack + 
seg->len <= wnd" test of the while() loop in tcp_output() at line 962 of 
tcp_out.c keeps any data from sending. This is the "my_wnd" line 621 of the 
debug dump. An ACK is sent instead (frame #31). lwIP makes one more attempt to 
send the unacked segment (which my driver drops again... and only that specific 
frame!) at line 770, and lwIP never calls tcp_output_segment() again.

So, does this seam normal?  Is everything just "plugged up" waiting for the PC 
to ACK that segment that keeps getting dropped?  Would the next 
tcp_output_segment() contain the same 795985:796699 segment that I keep 
dropping, but the PC app gives up before it should be sent (exponential 
backoff?)? After the first drop at 134447ms, the reattempts are made ~4sec 
later at 138733ms, then ~10sec after that at 148037ms. The PC closes the 
connection ~16sec after that.

Now I am off to find why my MAC is refusing to send a single unique frame 
amongst 100's of MB of data!

Thanks again,

-----Original Message-----
From: address@hidden [mailto:address@hidden On Behalf Of Dittrich, Matthew
Sent: Tuesday, October 25, 2011 1:43 PM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] Assert after dropped TX packet

Thanks for the reply Kieran,

I am glad to hear I am not (willfully) breaking the pbuf API rules. Hope you 
guys are ready for another long one! :)

> My guess is that you've hit the
> common fault of failing to ensure that there is only a single thread 
> active in the lwIP core at any one time.

I can assure you I am not in the stack from different threads.  I am using the 
raw API in order to avoid creating a task per connection.  My Ethernet task 
loop consists of a "handle_packets()", see below, netif->input is 
ethernet_input().  low_level_input() returns a PBUF_RAW from PBUF_POOL. Then 
sys_check_timeouts() is run, then my app level "monitor_events()" is run, this 
watches a FreeRTOS queue that the other app tasks put messages into for the 
ethernet task. The monitor_events() only makes tcp_writes() (no eth input 
handling here) and only when we are not currently sending something else.  Then 
a FreeRTOS vTaskDelay(1) will pause the task until the next tick. Then repeat 
the loop. No interrupts are involved.

I have been narrowing in on the failure, and it appears to happen only when the 
machine is running "cycles".  I mentioned that our protocol is mostly 
request/response, but there are "unsolicited" responses sent from 
monitor_events(), one happens at the beginning of a cycle, one at the end.  It 
also appears that these unsolicited responses must happen at the exact wrong 
time for things to fall apart. monitor_events() also maintains a timer that 
fires off the udp_sendto() for the broadcast, though that has been disabled 
lately to rule it out as a cause.

Attached is a pcap, and some debug output.  I prepended the value of sys_now() 
(units of milliseconds) and a hypen in my LWIP_PLATFORM_DIAG() definition, my 
(novice) comments are preceded by # and were added after the fact.  The debug 
output starts at packet #167 of the capture, I can provide debug output before 
that corresponding to this capture if more "good" transactions would help the 
analysis.  I also modified the two identical "tcp_output: snd_wnd " debug 
output lines to differentiate them.  Here is my reader's digest version of the 

1. Packet #195 is the last "good" segment sent from lwip (
2. Lwip sends no data after this
3. There is a 10 second pause between 196 and 197, this is an PC app level 
timeout, causing a "fallback" "poll" message request from the PC (
   During this "dead time" lwip thinks it sent 202 bytes, but that packet is 
not seen in wireshark (timestamp 433878 in debug)
   There is an "cycle start" (at 433930) unsolicited response sent, tcp_write() 
returns OK, never seen on wire.  After this the congestion window reduces to 
   A "cycle stop" (at 434852) message has the same fate... the actual wall time 
between start and stop events in the system is ~1.42 sec with this setup.
   Lwip attempts to send the same 2267210:2267412 with nothing on the wire 
during the dead time.
4. Packet #197 containing a new request breaks the cycle, the lwip app code 
services the request, tcp_write()'s without indication of failure (see " -mt 17 
ERR_OK!!!" app level debug output).
   However the test involving tcp_do_output_nagle() fails, and nothing is sent. 
 I "blew up" the macro into its own function for easier debug, 
   The result of its output and the flag test are in the debug dump as 
5. #198 appears to be a retransmission of 197 (only difference is ip ID field) 
from the PC, lwip app is not informed of it (correctly, I think) 6. 
tcp_do_output_nagle() keeps any data from sending for the rest of the capture.
7. acks are not sent until winXP does a cumulative retransmission, packets 
#202, #207, #214 8. In the end, the PC app has "so many" timed-out request that 
it gives up and closes the connection.  Somewhere along the line the assert() 
"went away", it has not happened since last Friday.

I tried dropping back to released 1.4.0 with no difference.  I have some 
captures where the while() loop "ntohl(seg->tcphdr->seqno) - pcb->lastack + 
seg->len <= wnd" test fails for awhile, then the tcp_do_output_nagle() fails in 
tcp_output(). I think it is a matter of timing of the unsolicited responses in 
relation to some dropped packet, I get different results depending upon how 
much debug output I have enabled.  Again, it only seems to happen when the 
machine is running cycles, creating the unsolicited messages, but the API 
returns "all good" every time.  These captures are from a direct PC-to-device 
physical connection, no routers/switches, lwip is handling a single TCP 
connection.  The fact that lwip thinks it is sending something during the "dead 
time" but nothing appears on the wire makes me question my driver again... but 
shouldn't lwip handle lost packets correctly? And why does it ACK the 
cumulative retransmissions only?  Those always make it out.  Is there anything 
that can fail after the "tcp_output_segment: " debug output?

Thanks again,


void ethernetif_handlepackets(struct netif *netif) {
  struct eth_hdr* ethhdr;
  struct pbuf* p;

  /* move received packet into a new pbuf */
  p = low_level_input(netif);

  while (p != NULL) {


    /* points to packet payload, which starts with an Ethernet header */
    ethhdr = p->payload;

#endif /* LINK_STATS */

    switch (htons(ethhdr->type)) {
    case ETHTYPE_IP:
    case ETHTYPE_ARP:

      if (netif->input(p, netif) != ERR_OK) {
        p = NULL;

      p = NULL;

    p = low_level_input(netif);


Attachment: 20111025e_wnd_short.pcap
Description: 20111025e_wnd_short.pcap

Attachment: 20111025e_wnd_short.txt
Description: 20111025e_wnd_short.txt

reply via email to

[Prev in Thread] Current Thread [Next in Thread]