[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno repo
From: |
Jörg F . Wittenberger |
Subject: |
Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting |
Date: |
20 Mar 2013 14:27:41 +0100 |
Hi all,
I'm not yet convinced that this patch will fix everything screwed
up by use of the tcp implementation.
The past days I wrote a replacement for my use. (A bit incomplete
wrt. API compatibility to the tcp unit and thrown into a module
I'm using to drive the SSL/TLS implementation I'm using for the
past couple of years; hence the module contains this code too
plus some SOCK4a setup code for use with tor... If anybody
is interested I'll forward the code or post it here, as you
guys like it.)
During the development I learned that Peter is *absolutely
correct* about the "strange error message" I needed help to
interpret a few days back. (When the logged error indicated
that a type test - (struct <uri>) it happend to read - failed
while the failed type was the same as the required one.)
This is obviously due to some stack corruption.
The same appears to apply to several other errors I observed
all sudden. Most prominent among them "heap full while resizing".
With the tcp replacement code, all those frequent errors are
suddenly gone. The code runs as stable as before now.
However I found a way to reliably trigger the problem anyway.
(Just run using PLT's dns resolver code against a bind server
AND close the port underneath. That is, a slightly modified
version, which will re-use the tcp connection.)
I'll explore this in the next days. For the time I'm not re-using
the tcp connection.
The code I wrote however is a major deviation from the existing
tcp code internally.
1.) No timeout parameters. (At least not at the lowest level.)
Why?
The Askemos/BALL code implements replication of sqlite3 databases
and files in a way similar to bittorrent. This type of p2p
network applications is subversive. You're deal with all
sorts of failures in the network, plus hostile clients.
a) In such a context it's little fun to maintain the timeout
at a call-by-call basis.
b) You want to have all sorts of different timeouts. E.g.,
wait for HTTP-keep-alive time for the next request line,
wait a reasonable short amount of time for the next chunk
in chunked encoding, even less for the next chunk header line.
c) Almost all timeouts never kick in. Thus the overhead of
inserting them into the timeout queue just to remove them
a fraction of a second later turns out to be expensive
and a huge slowdown for the overall i/o throughput.
This is even true with the scheduler improvements I posted
here (or at chicken-users ?) before, which would replace
the linear list for timeouts with an LLRB tree.
Therefore I'm using a different timeout handling, where
thimeouts are inserted into a mailbox and the entry is
kept at the callsite. Instead of removing the timeout
from the full list, the entry is invalidated. Once a second
the timeout queue/mailbox is replaced with a fresh one
and in the next run, those timeouts, which where not yet
invalidated are actually made active. Rather complicated
to describe, but much, much faster to execute.
2.) The lowlevel code structure is kept more akin to the
way it's handled in RScheme. Because this avoids those
tricks to distinguish ports by their prot-data to
eventually figure the tcp-adresses out.
3.) Avoid passing DNS names to tcp-connect. It depends the
obsolete (as per Linux manual at least) gethostbyname,
which could block the threading for too long time.
Do a DNS hostlookup instead.
4.) Don't duplicate code from library.scm ##sys#thread-yield!
to "yield". Use srfi-18 thread-yield! instead.
Best
/Jörg
PS/BTW: in "extras" read-lin there is a local definition
"fixup", which is unused.
On Mar 18 2013, Jim Ursetto wrote:
Here's a full patch to avoid context switches screwing up the error message
reported to the user, and also consolidates much of the error handling.
I think this patch is sufficient because the only actual issue, as I
understand it, is that under high load you will occasionally get an
incorrect error message (typically, "operation in progress") instead of
the real error message; an exception will still fire regardless.
Disabling interrupts instead is probably overkill, unless you know that
won't cause hangs.
Also the patch doesn't do any harm and cleans up the code a bit, so you
can still apply a different fix on top of it.
Jim
- [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Jim Ursetto, 2013/03/18
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Jim Ursetto, 2013/03/18
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Felix, 2013/03/18
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting,
Jörg F . Wittenberger <=
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Peter Bex, 2013/03/20
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Jörg F . Wittenberger, 2013/03/20
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Peter Bex, 2013/03/20
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Jörg F . Wittenberger, 2013/03/21
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Florian Zumbiehl, 2013/03/21
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Peter Bex, 2013/03/26
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, John Cowan, 2013/03/26
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Peter Bex, 2013/03/27
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Peter Bex, 2013/03/27
- Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting, Felix, 2013/03/29