I was in error to suggest this problem. At the time that I saw this
problem, the folks in question were running 0.6.3. In that version,
the user was responsible for the timer, and the usual implementation
just left it running, whether needed or not.
I can see what you mean about the use of the timer currently. It
should get launched from the tcpip thread when needed, and that should
preclude problems. Sorry about the confusion.
One other thing that had been an issue around that time were data cache
coherency problems related to the ethernet DMA. We eventually turned
off their data cache to avoid the confusion. Any chance that you have
such a problem?
Tom C. Barker wrote:
Jim,
Not barging in at all Jim. On the contrary,
thanks for the response. I can confirm
I am using lightweight protection and I
will take a look at the timer call. The call
to the tcp timer is made only when the
timer is _needed, though. What would be the
significance of the initial call to
sys_timeout if there is no tcp connection\no need
for a tcp timer at startup? It would seem
that a call to the tcp timer would result in
it firing once, finding no need to fire
again and never reschedule.
Thanks again,
Tom
Pardon me again for barging in. Keiran's analysis, particularly
regarding an unmotivated retransmit, sounded very familiar. I had a
problem like this at one of my clients. We changed two things and it
then went away.
First, we found and fixed a problem with the tcp_tmr. It was running
in the wrong task context. It must run in the tcpip thread. The usual
method for doing this is to make the initial call to sys_timeout from
within the callback function that executes when tcpip initialization is
done.
Second, we found that we weren't using the lightweight protection
option that I mentioned to you earlier.
I think it was actually the first thing that was causing the retransmit
problem, but we never found out for sure. It's really difficult to
track down resource conflicts. When the problem went away, we stopped
working on it.
Tom C. Barker wrote:
Thanks for your analysis Kieran. Forgive my assessment of
what ACKs are what: I was speaking of the multiple ACKs
the client sends back. ".65", the problem node, is in fact
the lwIP ftp server.
I have all my DEBUG statements on and find that I never get
a tcp_enqueue of the missing packet. It just skips over it.
My only priority is this issue right now so if you or anyone
has any ideas of what I can watch for I open to ideas. Meanwhile
I'm crafting a bit-patterned file to help identify where the
problem is occurring.
Tom
-----Original Message-----
From: address@hidden
[mailto:address@hidden]On Behalf
Of Kieran Mansley
Sent: Friday, March 04, 2005 1:29 AM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] FTP-DATA exchange: TCP issues
On Thu, 2005-03-03 at 09:54 -0800, Tom C. Barker wrote:
Hello,
Maybe to short-circuit this issue, I am working with
0.7.2 and am in the process of moving to 1.1.0 so if
the following problem resembles a bug prior to 1.1.0,
please let me know.
In testing an ftp implementation where I will occasionally
successfully transfer a 400k file, I have come across a
consistently reproducible issue where my lwIP ftp server
seems to have dropped an ACK in that according to the
attached (truncated-packets) ethereal file, the packet on
line 249 should have ACK'd 264364, but instead ACKs 267284.
The rest of the (doomed) transaction is spent trying to
shoehorn in a few packets to the client's unacked queue.
Your description doesn't seem to match the trace that you've attached.
There is no packet there that ACKs 267284.
However, there is clearly something going wrong in that data transfer.
The problem seems to me to start with packet 245, which (i) is a
retransmission (of packet 242) when none seems necessary and (ii)
doesn't have the same payload as the earlier transmission of the same
data. Looks to me like packet 245 has got the wrong sequence number on
it, and it is in fact the payload of the next in-order packet.
Something similar happens with packet 244 and 247: 247 is a
retransmission of 244, but would not seem to be necessary, and this time
they both have the same payload.
What's more worrying is that the ".65" node then fails to retransmit the
correct data when it should: it gets many duplicate acknowledgements for
264364, which should lead it to retransmit that packet, but it refuses.
I can't explain this is in full, but hopefully that will give you some
clues about what might be wrong. You could compare the captured
payloads against the file that is being transferred to check my theory
about 245 having the wrong sequence number.
Kieran
_______________________________________________
lwip-users mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/lwip-users
_______________________________________________
lwip-users mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/lwip-users
--
Jim Gibbons
|
address@hidden
|
Gibbons and Associates, Inc.
|
TEL: (408) 984-1441
|
900 Lafayette, Suite 704, Santa Clara, CA
|
FAX: (408) 247-6395
|
_______________________________________________
lwip-users mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/lwip-users
--
E-mail signature
Jim Gibbons
|
address@hidden
|
Gibbons and Associates, Inc.
|
TEL: (408) 984-1441
|
900 Lafayette, Suite 704, Santa Clara, CA
|
FAX: (408) 247-6395
|
|