lwip-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lwip-devel] Re: [task #7040] Work on tcp_enqueue


From: Jakob Stoklund Olesen
Subject: [lwip-devel] Re: [task #7040] Work on tcp_enqueue
Date: Sat, 31 Jan 2009 14:59:15 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

OK, back on lwip-devel...

Jonathan Larmour <address@hidden> writes:
> Follow-up Comment #11, task #7040 (project lwip):
>
> Re comment #8: I think that this is leaning even further towards the argument
> of shaking up the raw TCP API by allowing data to be written as pbufs rather
> than char * + lengths.

I don't think I understand why this would be a good idea. Could you
elaborate, please?

For UDP I get it: UDP protocols are often very simple. You allocate a
pbuf, cast p->payload to a struct pointer, fill out the fields and send
it. Very easy.

TCP is more difficult: You want to send a 1500-byte message. You call
tcp_pbuf_alloc and get back a pbuf chain with 17+1460+23 bytes because
that layout gives the best TCP segmentation. Now you have to deal with
next pointers, len/tot_len, and alignment issues. You give up, malloc
1500 bytes, format your message, and copy it into the pbuf chain.

Annoying example, I know, but not unthinkable.

I went back and read some of the old discussions about this issue. In
particular task #6735. I think the issue of TCP segmentation and
application code behaviour was not discussed.

I think we need to consider how lwIP is used, looking at the full
software stack.

In a small system with very limited resources, you would write
specialized application code directly on top of the raw API:

  App code -> Raw API

A larger system with more complex software would provide some form of
abstraction:

  App code -> Adaptation layer -> Raw API

The adaptation layer could be netconn for threads, sockets for legacy
code, or something different. (I use C++ based asynchronous message
passing).

In the first scenario you want the raw API to be relatively easy to
use. There should not be too many special cases, or you will get obscure
bugs. In the second scenario, the app code should be oblivious to issues
like segmentation and scatter-gather DMA. The adaption layer should be
able to do a decent job with different traffic patterns (within reason).

Typical traffic patterns include:

Small writes: Syslog over TCP. Each write is 20-120 bytes, no alignment,
multiple writes must be combined into one segment for proper throughput.

Medium-sized writes: CORBA IIOP, iSCSI. It would be reasonable to send
each write in its own segment most of the time. Application might even
set SO_NDELAY to encourage that.

Large writes: FTP, HTTP. Throughput is important. Large chunks of
continuous data is available, so full-sized zero-copy segments should be
possible.

In the case of small writes we probably cannot expect zero-copy
transmission, but single-copy would not be unreasonable. In my system,
small writes are copied twice: Once in tcp_enqueue, and once in the
driver because it cannot handle the long irregular pbuf chains.

Medium-sized writes should be zero-copy if the app code delivers aligned
data. Of course this depends on the particulars of the driver.

Large writes should always be zero-copy if at all possible.

When the driver requires data in a special region of memory, we should
aim for single-copy transmission. Zero-copy would require bad layering
violations.

It would be possible to make a 100% zero-copy API, but in reality you
would just be moving the copying into the app code.

My point is this: I would prefer a clean API that allows single-copy to
a complicated one that supports zero-copy in every case. When we copy
data, we should make the most of it: Make sure that it only happens
once, and calculate a checksum while we are at it.

/stoklund






reply via email to

[Prev in Thread] Current Thread [Next in Thread]