[lwip-users] Assert after dropped TX packet

lwip-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lwip-users] Assert after dropped TX packet

From:	Dittrich, Matthew
Subject:	[lwip-users] Assert after dropped TX packet
Date:	Thu, 20 Oct 2011 15:14:19 -0500

Hello list,

I am experiencing a repeatable, but infrequent, assert after receipt of a RST.  
This happens in tcp_pcbs_sane() with some debugging turned on, and tcp_input() 
("tcp_input: TIME-WAIT pcb->state == TIME-WAIT") with it turned off... but that 
is just a symptom, not the problem.  This assert does not fire when closing the 
connection normally before the problem shows up.

Background:

I am on git master tip (upgraded from some "post 1.4.0" cvs version once this 
issue came up), have TCP_OVERSIZE set to 0 (changed from TCP_MSS (default) once 
this issue came up and I read the feature should not really be trusted (but it 
is working fine to my knowledge in a different product of ours)).  I have 
TCP_DEBUG, TCP_FR_DEBUG, TCP_RTO_DEBUG, TCP_CWND_DEBUG, TCP_OUTPUT_DEBUG, and 
TCP_QLEN_DEBUG enabled. I have plenty of memory resources.  I am running RAW 
api in a single FreeRTOS task, polling the MAC (NXP LPC2468), all tcp_write()'s 
are done with TCP_WRITE_FLAG_COPY after checking tcp_sndqueuelen() and are 
limited to LWIP_MIN(tcp_sndbuf(), 2 * tcp_mss()) (the result of some 
"monkey-see-monkey-do" coding after looking into some of the example code).

Our app protocol is (mostly) simple request/response, lwip app parsing of new 
input is dependent upon not currently responding to a previous request.  Most 
responses are short (as in it can send the whole thing immediately (lwip has 
enough memory to buffer it up)), so the "is currently responding" conditional 
path is rarely used.

My opaque "arg" struct contains a pbuf pointer "rx_pbuf", if NULL the recv 
callback sets it to the passed in *p (and sets a "rx_pbuf_idx" to zero).  If 
rx_pbuf is not NULL, I pbuf_cat(rx_pbuf, p).  Then still within the recieve 
callback, I check my tx state, and if idle I call my "process input" routine 
which runs over rx_pbuf byte-by-byte parsing the request, see below.  Else, 
rx_pbuf sits around until tx state machine is idle, "process input" is also 
conditionally called from the poll callback.

uint8_t iqe_process_input(iqe_state_t * iqe)
{
  uint8_t  request_valid = FALSE;
  
  while(iqe->tx_state    == IQE_TX_STATE_IDLE &&
        iqe->rx_pbuf     != NULL &&
        iqe->rx_pbuf_idx  < iqe->rx_pbuf->tot_len)
  {

    /* feed parse_incoming_msg() state machine with 
pbuf_get_at(iqe->rx_pbuf_idx++) */
    request_valid = parse_incoming_msg(iqe, pbuf_get_at(iqe->rx_pbuf, 
iqe->rx_pbuf_idx++));

    /* cleanup our pbuf chain as we "eat"... */
    if(iqe->rx_pbuf_idx >= iqe->rx_pbuf->len) {
      /* we have exhausted the data in the first pbuf of the chain */
      /* let the stack know it can expand the TCP receive window */
      tcp_recved(iqe->pcb, iqe->rx_pbuf->len);

      /* update our "into the chain index" */
      iqe->rx_pbuf_idx -= iqe->rx_pbuf->len;

      /* and free the memory */
      iqe->rx_pbuf = pbuf_unchain(iqe->rx_pbuf);
    }

    if(request_valid){
    
      /* do work here, generates output, possibly moves iqe->tx_state from IDLE 
*/

      /* reset the rx packet state machine (in parse_incoming_msg()) to handle 
the next request */
      iqe->rx_msg_state = IQE_RX_STATE_ENV;
      request_valid = FALSE;
    }

  }

  return 0;
}

and pbuf_unchain() (derived from pbuf_dechain()):

struct pbuf *
pbuf_unchain(struct pbuf *p)
{
  struct pbuf *q;
  /* tail */
  q = p->next;
  /* pbuf has successor in chain? */
  if (q != NULL) {
    /* assert tot_len invariant: (p->tot_len == p->len + (p->next? 
p->next->tot_len: 0) */
    LWIP_ASSERT("p->tot_len == p->len + q->tot_len", q->tot_len == p->tot_len - 
p->len);
    /* enforce invariant if assertion is disabled */
    q->tot_len = p->tot_len - p->len;
    /* return remaining tail or NULL if deallocated */
  }

  /* decouple pbuf from remainder */
  p->next = NULL;
  /* total length of pbuf p is its own length only */
  p->tot_len = p->len;
  /* q is no longer referenced by p, free p */
  if (pbuf_free(p) > 0) {
    IQE_DEBUG_PRINT((VT100_YELLOW_TEXT("-pbuf_unchain: deallocated %p\r\n"), 
(void *)p));
  } else {
    IQE_DEBUG_PRINT((VT100_RED_TEXT("-pbuf_unchain: DID NOT deallocate 
%p!!!\r\n"), (void *)p));
  }

  /* assert tot_len invariant: (p->tot_len == p->len + (p->next? 
p->next->tot_len: 0) */
  LWIP_ASSERT("p->tot_len == p->len", p->tot_len == p->len);

  return q;
}

Issue:

This worked great until we started stress testing, the PC app (winXP SP3 in 
this case) was changed to make back-to-back-to-back "graph data" requests 
indefinitely (each response only amounts to a couple MSS sized packets of 
BASE64, but it's the biggest response our app generates).  Everything runs for 
minutes or hours without incident, then lwip seems to drop a tx segment and 
cannot recover.  The PC app "times-out" expecting the graph data and falls back 
to our "keepalive/poll" message request (line 454 of the debug dump, packet #51 
of the pcap).  Lwip ack's each of these requests at the TCP level (so I am not 
questioning my MAC driver) but there is no app data with it. The receive 
callback gets the data, the app generates a response, tcp_write() returns OK 
(until I run out of sndqueuelen, then I stop calling it). Everything (that I 
know of) that lwip tells the app side of the code looks good.  After a number 
of no-response requests, the PC app closes the connection, and that's when lwip 
assert()'s. This issue only seems to happen with large amounts of our "graph 
data" requests, and then only "randomly".  I have also tried explicitly calling 
tcp_output() with no change.

I have read all the warnings about how "pbuf queues" (verses chains) are not 
supported by some of the API functions, am I breaking that rule with my 
"rx_pbuf" mechanism?  The chain vs queue distinction is kinda lost on me... Is 
the *pbuf passed to the receive callback a chain or a queue? I am I mistakenly 
creating a queue instead of a chain?  If I am "doing it wrong", what is the 
recommended way to keep track of pbuf's that have already been tcp_recved()'d 
but not yet serviced by the app? Or does this have nothing to do with the tcp 
re-TX issue?  

Attached is a pcap and a text file with the debug output.  Lines preceded with 
a '-' are my app debug (many "proven ok" app level debug messages are removed 
from this build to help speed up the networking). 192.168.10.5 is the winXP 
machine, 192.168.10.6 is lwip.  Not included in the pcap is our UDP broadcast 
(a udp_sendto()'d PBUF_REF pointing to a statically allocated array sent every 
500ms) which is part of our discovery mechanism and runs all the time, the 
broadcasts are not affected until the assert().

Any hints would be greatly appreciated... Sorry for the long post, I hope you 
made it through the whole thing! This ended up much longer than I intended.

Thanks,
MD

20111020c_lwip_assert_10.6_tcp.txt
Description: 20111020c_lwip_assert_10.6_tcp.txt

20111020c_lwip_assert_10.6_tcp2.pcap
Description: 20111020c_lwip_assert_10.6_tcp2.pcap

[Prev in Thread]

Current Thread

[Next in Thread]

[lwip-users] Assert after dropped TX packet, Dittrich, Matthew <=
- Re: [lwip-users] Assert after dropped TX packet, Kieran Mansley, 2011/10/25
  - Re: [lwip-users] Assert after dropped TX packet, Dittrich, Matthew, 2011/10/26
    - Re: [lwip-users] Assert after dropped TX packet, Dittrich, Matthew, 2011/10/26
    - Re: [lwip-users] Assert after dropped TX packet, Kieran Mansley, 2011/10/27
    - Re: [lwip-users] Assert after dropped TX packet, Dittrich, Matthew, 2011/10/27

Prev by Date: Re: [lwip-users] ERR_CLSD error after a fixed number of read/write
Next by Date: [lwip-users] Binary/counting semaphores and lwIP
Previous by thread: [lwip-users] opening multiple sockets fails
Next by thread: Re: [lwip-users] Assert after dropped TX packet
Index(es):
- Date
- Thread