lwip-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] Assert after dropped TX packet


From: Dittrich, Matthew
Subject: Re: [lwip-users] Assert after dropped TX packet
Date: Tue, 25 Oct 2011 13:42:51 -0500

Thanks for the reply Kieran,

I am glad to hear I am not (willfully) breaking the pbuf API rules. Hope you 
guys are ready for another long one! :)

> My guess is that you've hit the
> common fault of failing to ensure that there is only a single thread
> active in the lwIP core at any one time.  

I can assure you I am not in the stack from different threads.  I am using the 
raw API in order to avoid creating a task per connection.  My Ethernet task 
loop consists of a "handle_packets()", see below, netif->input is 
ethernet_input().  low_level_input() returns a PBUF_RAW from PBUF_POOL. Then 
sys_check_timeouts() is run, then my app level "monitor_events()" is run, this 
watches a FreeRTOS queue that the other app tasks put messages into for the 
ethernet task. The monitor_events() only makes tcp_writes() (no eth input 
handling here) and only when we are not currently sending something else.  Then 
a FreeRTOS vTaskDelay(1) will pause the task until the next tick. Then repeat 
the loop. No interrupts are involved.

I have been narrowing in on the failure, and it appears to happen only when the 
machine is running "cycles".  I mentioned that our protocol is mostly 
request/response, but there are "unsolicited" responses sent from 
monitor_events(), one happens at the beginning of a cycle, one at the end.  It 
also appears that these unsolicited responses must happen at the exact wrong 
time for things to fall apart. monitor_events() also maintains a timer that 
fires off the udp_sendto() for the broadcast, though that has been disabled 
lately to rule it out as a cause.

Attached is a pcap, and some debug output.  I prepended the value of sys_now() 
(units of milliseconds) and a hypen in my LWIP_PLATFORM_DIAG() definition, my 
(novice) comments are preceded by # and were added after the fact.  The debug 
output starts at packet #167 of the capture, I can provide debug output before 
that corresponding to this capture if more "good" transactions would help the 
analysis.  I also modified the two identical "tcp_output: snd_wnd " debug 
output lines to differentiate them.  Here is my reader's digest version of the 
cap/debug:

1. Packet #195 is the last "good" segment sent from lwip (192.168.10.6).
2. Lwip sends no data after this
3. There is a 10 second pause between 196 and 197, this is an PC app level 
timeout, causing a "fallback" "poll" message request from the PC (192.168.10.5).
   During this "dead time" lwip thinks it sent 202 bytes, but that packet is 
not seen in wireshark (timestamp 433878 in debug)
   There is an "cycle start" (at 433930) unsolicited response sent, tcp_write() 
returns OK, never seen on wire.  After this the congestion window reduces to 
MSS.
   A "cycle stop" (at 434852) message has the same fate... the actual wall time 
between start and stop events in the system is ~1.42 sec with this setup.
   Lwip attempts to send the same 2267210:2267412 with nothing on the wire 
during the dead time.
4. Packet #197 containing a new request breaks the cycle, the lwip app code 
services the request, tcp_write()'s without indication of failure (see " -mt 17 
ERR_OK!!!" app level debug output).
   However the test involving tcp_do_output_nagle() fails, and nothing is sent. 
 I "blew up" the macro into its own function for easier debug, 
   The result of its output and the flag test are in the debug dump as 
"my_do_nagle"
5. #198 appears to be a retransmission of 197 (only difference is ip ID field) 
from the PC, lwip app is not informed of it (correctly, I think)
6. tcp_do_output_nagle() keeps any data from sending for the rest of the 
capture.
7. acks are not sent until winXP does a cumulative retransmission, packets 
#202, #207, #214
8. In the end, the PC app has "so many" timed-out request that it gives up and 
closes the connection.  Somewhere along the line the assert() "went away", it 
has not happened since last Friday.

I tried dropping back to released 1.4.0 with no difference.  I have some 
captures where the while() loop "ntohl(seg->tcphdr->seqno) - pcb->lastack + 
seg->len <= wnd" test fails for awhile, then the tcp_do_output_nagle() fails in 
tcp_output(). I think it is a matter of timing of the unsolicited responses in 
relation to some dropped packet, I get different results depending upon how 
much debug output I have enabled.  Again, it only seems to happen when the 
machine is running cycles, creating the unsolicited messages, but the API 
returns "all good" every time.  These captures are from a direct PC-to-device 
physical connection, no routers/switches, lwip is handling a single TCP 
connection.  The fact that lwip thinks it is sending something during the "dead 
time" but nothing appears on the wire makes me question my driver again... but 
shouldn't lwip handle lost packets correctly? And why does it ACK the 
cumulative retransmissions only?  Those always make it out.  Is there anything 
that can fail after the "tcp_output_segment: " debug output?

Thanks again,
MD

-------------------------------------------------------

void ethernetif_handlepackets(struct netif *netif)
{
  struct eth_hdr* ethhdr;
  struct pbuf* p;

  /* move received packet into a new pbuf */
  p = low_level_input(netif);

  while (p != NULL) {

    eth_dbg_hp_cnt++;

    /* points to packet payload, which starts with an Ethernet header */
    ethhdr = p->payload;

#if LINK_STATS
    lwip_stats.link.recv++;
#endif /* LINK_STATS */

    switch (htons(ethhdr->type)) {
    case ETHTYPE_IP:
    case ETHTYPE_ARP:

      if (netif->input(p, netif) != ERR_OK) {
        pbuf_free(p);
        p = NULL;
      }
      break;

    default:
      pbuf_free(p);
      p = NULL;
      break;
    }

    p = low_level_input(netif);
  }
}

-------------------------------------------------------



Attachment: 20111025b_no_recovery.pcap
Description: 20111025b_no_recovery.pcap

Attachment: 20111025b_no_recovery_short.txt
Description: 20111025b_no_recovery_short.txt


reply via email to

[Prev in Thread] Current Thread [Next in Thread]