I am working
on a data acquisition system using an Analog Devices' Blackfin BF537, which has
a 100Mb/s MAC and utilizes a port of lwip. The lwip port appears to
be derived from STABLE-0_6_3. My application requires high throughput on
the ethernet interface (~20Mb/s), so I have been creating very simple
applications to run on the embedded processor with lwip to test the throughput
and reliability of the setup. The sample application on the
BF537 simply creates, binds, and listens on a socket, and then in an
infinite loop accepts a single connection and then while that connection is
open sends large packets (1460 bytes) on the connection. I have a simple
LabVIEW application that receives the data, and I have also been using the
Wireshark analyzer to look at the transfers. In this configuration, I am
experiencing the following that I would really appreciate some insight on:
1) When lwip
is configured to use DHCP, it is very difficult to maintain a high
throughput. In fact, the connection very frequently times out after
transferring just a few packets. I don't see much other traffic related
to having the DHCP server on the LAN, and I use a switch to isolate the
transmitting device and the receiving PC.
[TT] This could be a function of
the configuration of your DHCP server, and the length of lease that is granted
during the initial dhcp negotiation.
I will confirm this. I have
attached a log file showing a case that timed out after a few transfers (070323
DHCP Startup Failed, some data.pcap), and one that failed with no data
transferred (070323 DHCP Startup Failed.pcap).
2) When not
using DHCP, in general the connection is more reliable. However, there
appears to be a "cold start" issue, where when the devices on the LAN
(transmitter, switch, and receiving PC) are powered on for the first time the
connection has trouble establishing itself. A few packets will transfer
successfully, followed by a dropped packet with no successful retransmissions
over 30 seconds.
[TT] This is pretty hard to
diagnose. To my mind, it sounds like it could be problems with the way in
which the application design at system startup. To diagnose this more
closely, sniffer logs would be needed.
I have attached a log file showing this failure
(070323 Startup Failure.pcap). Do you have a recommendation for the way
the system should startup?
[TT] Not really. I might try to isolate the
problem by trying various ordered startup procedures, then maybe a fix would
Is a delay between accepting the connection and
transmitting data likely to improve this issue? There is already a
considerable delay between when I power the switch and when I make the
[TT] It could. If you try to
send before the link has been established, there could be some problems
with dropped packets. DHCP can work up to some fairly long waits, which
would delay establishment of any connections. If your port incorrectly (as
I just found mine does) marks the netif as ‘up’ at interface open
time when dhcp is enabled, then there could be some issues. The dhcp
framework marks the interface as ‘up’, via the netif_set_up() once
the dhcp bind occurs.
without DHCP, I can observe stalls in the transmitted data stream.
Normally, packets are transmitted more than once a millisecond (up to 8 or ten
per millisecond), but occasionally there are periods of ~150ms where no data is
transmitted. The receive window has not closed, and there is not indication
of dropped packets or retransmission in the log file.
[TT] It could be that the
transmit window (assuming TCP) is full. It could also be something to do
with the multitude of #defines that tune the performance/space in opt.h.
Some sniffer logs may shed some light on the issue. What window size does
the remote end advertise?
The remote end advertises a 64k window size.
I wasn't clear on a lot of the #defines - I've attached my option header file,
could you comment? Is there somewhere I have limited my transmit window to just
a few segments?
[TT] Yes, in the sniffer log I
see your transmit window is limited to 8192, which is pretty small. This
is governed via the TCP_WND #define in your lwipopts.h.
without DHCP, I observe ~2s stalls. These appear to be caused by >1
dropped packet, which results in the first dropped packet being resent by fast
retransmission, and all other packets being resent by the retransmission
[TT] This sounds like
half-duplex Ethernet operation to me. Make sure you don’t have any
half-duplex hubs floating around on your network. These will cause random
wait times on the order you mentioned.
I confirmed thatthe 3 devices comprising my LAN
(embedded device, hp switch, and ibm laptop) are all at least 10/100 auto
negotiate half/full duplex, and the ibm laptop is a 1Gb device. Other
than forcing the devices to 100Mb Full duplex, is there a way to confirm that
nobody is operating at half duplex? [TT] Not without some access to the driver
statistics, or a LAN analyzer. If you have access to some driver
statistics, and you see any collisions, then you know there’s a
half-duplex device on that segment.
Can you clarify why a half-duplex hub would
cause random waits?
[TT] It’s due to the collision
handling protocols of the CSMA/CD thing. I’m having trouble viewing
the 802.3 standard at the moment, but the basic operations is as follows.
If a node starts to send an Ethernet frame, but detects a collision, it backs
off for a random interval, which, if I recall correctly, can range upwards of a
second, before it attempts a retransmit.
confirm that any or all of these behaviors is unexpected in a LAN environment
(RTT normally <1ms)? Although I'm new to this, it seems
surprising that my little LAN with <15' CAT5 cable segments is so likely to
have corrupted or lost packets.
[TT] An old hub or faulty
connector can cause all sorts of issues. I’d revert back to as
simple a network as possible, and proceed from there, adding segments until
some bad behavior is exhibited.
I can try this with just a crossover cable, but
there's not much room to go simpler. For the DHCP problems, can you recommend a
simple way to add a DHCP server without connecting into my full office network?
[TT] I loaded up one of my
targets with an Ubuntu install, then installed the dhcpd3 server. This
gives me additional visibility into what’s going on with the DHCP negotiation,
and I can try out various options, etc.
give me some guidance on what to expect regarding lost packets?
[TT] An analysis I did some time
back for an avionics platform concluded that I could expect that the phy, at a
minimum, would cause one lost/corrupt packet per 24 hour period on a 3 in. long
peer to peer link. It seems to me that a dozen a day on a small network
would not be unusual.
A dozen a day doesn't sound unreasonable. I'm
currently able to generate what I assume are lost/corrupt packets within a 20
or 30 second log file.
recovery processes I've observed correct behavior? Should only a single
packet be resent usign fast retransmission? Is there anything inherent in
the stack that could cause brief pauses in the data stream? Why does
using DHCP apparently make it so difficult to establish and maintain a
high-throughput connection, particularly since there doesn't seem to be any
other traffic on the LAN?
for the multiple questions, but I needed to start somewhere, and I've already
reached the limit of what the Analog Devices' support engineers can help
with. I can provide the log files from Wireshark if that would be
helpful, but some are very large (tens of megabytes). I'd also be
interested if anyone can suggest other resources to further my understanding of
networking and TCP/IP issues.
[TT] You’d start by
locating the portions of the capture logs that show aberrant behavior.
I'll follow up with those logfiles shortly. Is
there an easier way to cut them down to size than using the editcap
[TT] I sometimes use the GUI,
highlight the sections I want, then save the selection to a file. I’ve
never tried the editcap, but it sounds painful.