lwip-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] TCP bandwidth limited by rate of ACKs


From: Bill Auerbach
Subject: Re: [lwip-users] TCP bandwidth limited by rate of ACKs
Date: Tue, 11 Oct 2011 15:56:29 -0400

>Bill Auerbach wrote:
>
>> Mason wrote:
>>
>>> I was expecting to reach 80+ Mbit/s, so I captured the conversation
>>> with Wireshark, and I noticed that the sender is being throttled
>>> because the receiver (the STB running lwip) is not sending ACKs
>>> fast enough.
>>>
>>> cf. STB_TCP_RX_2.pcap (6 MB -- I truncated payloads to 100 bytes)
>>> http://dl.free.fr/tKDBuYTrc
>>>
>>> Can someone nudge me in the right direction to optimize my
>>> build of lwip?
>>
>> Some things are to optimize are the internet checksum,
>
>As a first-order approximation, I simply disabled CHECKSUM_CHECK_TCP.
>I don't think the IPv4 checksum check (20 bytes) has much of an impact
>on performance, do you?
>
>> use zero-copy receive and transmit,
>
>I wish I could, but I've no idea how to accomplish that.
>cf. my previous thread "Custom memory management"
>http://lists.gnu.org/archive/html/lwip-users/2011-10/msg00008.html

For DMA RX receive (which also allows zero-copy) there was a post here a
long time ago on how to do that since I needed the same thing for the
PowerPC and implemented it based on the idea described.  The idea was you
have a queue of pbuf pointers.  You set up DMA into the payload of the first
pbuf.  The Eth RX ISR increments the RX count and sets up DMA for the next
pbuf in the list (looping at the end).  If applicable, post to a waiting
task that a packet (or more) is ready.  I used this RX count in a background
polling loop.  In the main app or task waiting for packets, remove the
oldest packet from the queue and allocate a new pbuf and replace the old
pbuf pointer with the new one.  Nothing is copied- just swap pointers.  Pass
this pbuf to ethernetif_input.  I process all packets that are waiting in
one shot.  lwIP frees the pbuf when done.

>> do SMEMCPY more efficiently,
>
>I'm not sure I can do it better than my platform's memcpy.

You definitely can.  I did a better memcpy (over the one supplied with GCC)
with some assembly using unrolled index addressing and dword copies when
data was aligned and improved most copies by 50%.

>I will try profiling to see if we're indeed spending most of
>the time copying data.

It doesn't have to be most of the time to pick up 20Mbps of speed.  It is
unfortunately harder to receive than transmit.  We use lwIP in real-time
products where either RX is critical (so we used a PowerPC there) or where
TX was critical (so we used UDP since it was a 100Mhz FPGA-based processor).
Currently on the 100Mhz product, RX is important but we only need 40Mbps.  I
found forcing 100Mpbs was better than 1000Mbps because of the packet drops
at the processor from not being able to handle packets fast enough.

>> and use RAW API.
>
>I can't. I need to implement the BSD sockets API.

This requirement may be the show-stopper.

Bill




reply via email to

[Prev in Thread] Current Thread [Next in Thread]