With the coprocessor being so much slower than the host, I'm really
concerned about the overall effect upon latencies, and perhaps even
bandwidth. You could end up reducing TCP/IP performance by adding
coprocessor functionality. I would again urge you to look at the
fraction of time your host is spending in the TCP/IP stack, if at all
possible. If you are bound by stack performance, that may devolve to
determining the amount of time you are spending in the kernel as
opposed to your app(s). If that fraction is small, then it may not be
worth your while to try to reduce it. For example, if you are spending
90% of your time in your app and 10% of your time in TCP/IP, then
cutting the TCP/IP time in half would only net you a small change in
If your protocol is heavily acknowledged and you find yourself being
performance bound by the performance of the protocol, any additions to
latencies will end up making you slower, not faster. All that is
speculation on my part, of course. You could be compute bound with a
streaming TCP/IP output, in which case additions to latencies wouldn't
have any effect at all.
As for the RTOS question, you can find some surprisingly small ones.
We have used uC/OS-II without being horrified by its size. Depending
on the CPU you are using in the coprocessor, you may find that you have
some pretty good options.
I do believe that it would probably be easiest to put on a top layer as
you describe, but I also think that it would be feasible to transport
the messages to the tcp thread as you originally described. As you
note, there are some difficulties, and it is possible that the message
contents will have to be augmented to deal with some of the existing
data references. In either event, you will almost certainly find
yourself tinkering with the stack in one way or another. The good news
is that with a small open source project like this, it is definitely
feasible to do this. The bad news is that it can still be a fair
amount of work.
I'm really a bit conflicted about this. On the one hand, it does sound
like a really interesting thing to do technically. On the other, it
may actually end up costing you in system performance. I hope you'll
be able to make a good analysis of the likely outcome before you commit
Curt McDowell wrote:
Thanks for the input, Jim.
>As for the performance
improvement, that's a very significant question. First, I think that
it is important to ask what kind of performance improvement you seek.
If you are just seeking to offload the host, so that it can go on to do
some other task faster, then you stand a reasonable chance of seeing
that happen. If you are ultimately seeking to increase TCP/IP
throughput, that will be a more difficult road.
In our case, the host processor would be about 4 times as powerful as
the coprocessor. The coprocessor has some spare cycles, and it'll be
there regardless of whether it ends up doing TOE. The goal is simply
to reduce CPU consumption on the host processor with no reduction in
throughput. The MAC has no checksum acceleration, so that's actually
one of the most important things to off-load.
> I feel that
your assessment of feasibility is sound and that your list of problems
and their resolution is reasonably complete. Something always shows up
in implementation, and I'm sure that your project will be no exception,
but I do think that your design is solid.
I'm finding that splitting the modules in the
manner depicted is not so easy after all. E.g., for efficiency reasons
the top layer routine netconn_write() calls tcp_sndbuf(),
which peeks in the bottom layer data structure. It's tempting to just
add a top layer to RPC the whole sockets API (but unfortunately, the
tiny RTOS on the TOE processor would then need to support threads).
Curt McDowell wrote:
I'm looking into using lwIP as the basis for a TOE (TCP/IP offload
engine). If I understand correctly, the lwIP environment is
implemented as one thread for the IP stack, and one thread for each
APPLICATION THREAD IP STACK
Sockets <-> API-mux <------------> API-demux <->
Stack <-> netif
This architecture appears to lend itself fairly well to the following
TOE implementation (actually, SOE, as it would be a full sockets
PROCESSOR TOE ADAPTER W/ EMBEDDED CPU
+-------------+ +--------------+ +-------+ +----------+
| App using |---| lwIP library |------------| lwIP |---| Network
| sockets API | | Sockets API | Hardware | stack | | hardware |
+-------------+ +--------------+ bus +-------+ +----------+
- Does this assessment sound correct?
- Could a significant performance improvement be realized, compared
to using a host-native IP stack?
- Is anyone else interested in this type of application?
The only problems that I see are with the mbox
transport mechanism, in that it assumes a shared address space.
- It would need to send the data, instead of
pointers to the data.
- It would need to send messages for event notifications instead of
- Message reception on either side of the hardware bus would
be signaled through interrupts.
|Gibbons and Associates, Inc.
|TEL: (408) 984-1441
|900 Lafayette, Suite 704, Santa Clara, CA
|FAX: (408) 247-6395