l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Getting Started with Hurd-L4


From: Marcus Brinkmann
Subject: Re: Getting Started with Hurd-L4
Date: Mon, 25 Oct 2004 22:35:47 +0200
User-agent: Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Mon, 25 Oct 2004 18:30:31 +0100,
Sam Mason <address@hidden> wrote:
> >Both questions are central to the Hurd design decisions.  You should
> >find out quickly that your simple idea will neither of them, and
> >something more complex is needed.
> 
> I'm of the opinion that you can model very complex abstractions on top
> of my simple idea - from POSIX strangeness to the sorts of algorithms
> that would be needed to route data efficiently between the modules in a
> micro-kernel based operating system.
> 
> I'm interested to hear how your "Containers" work first though.  It
> sounds as though you've spent quite a bit of time thinking about their
> design.

There is some description in hurd-on-l4.tex, but, it's still pretty
rough.  The goal of containers is to allow sharing memory between
tasks that do not trust each other, and cooperation is not necessary
(even a malicious partner can't mess things up too badly).  In
particular, the client can revoke access to a container at any time to
reclaim the resources, and the server will have to take care that it
might do that.

Containers will serve many purposes, in fact, as I see them, they are
the only basic object for all things concerning memory management (so
they can be used in several different ways).  But their main purpose
is to allow sharing memory.

Containers will also allow to keep track of who pays for a resource.

It's probably best to start considering a couple of common operations
to get a grip on it.  It would be useful to express such common
operations in terms of your alternate model if you want to suggest it
as an option.

For example, the basic operation will be to allocate some frames into
a container, and then to map them into your address space.  This
sounds simple enough, right?  But there are issues.  For example,
physmem must be allowed to unmap the frames whenever it wants, to move
their physical location (to make space for DMA operations or super
pages, for example).

Then you may want to share memory.  If the other task is trusted, you
can just pass a capability to a container containing some frames to
the other task, and it can map the memory as well.  However, if the
task posesses a capability to the real object, it could also start to
allocate memory on your cost, etc.

If you want to share memory with a more or less untrusted other task,
for example a filesystem, then you don't pass a cap to the real
container, but to some proxy object that gives limited access to the
container.  For example, the proxy object does not allow to allocate
more frames than already are in the container (or only a limited
amount).  The access right may be revoked at any time (imagine root
doing a kill -9 on the task with the container!).  However, normally
this won't happen.  The filesystem can then pass on access to the
container to a device driver, for example for a DMA operation.  A
privileged operation can wire the physical address of the containers
memory for the DMA operation.  All this can be done while, for a
limited time, denying access to the memory by the original owner of
the container, to prevent messing up with the DMA operation.  In
addition, Neal was considering some flags that show how many
operations have been performed on the container, to make sure that the
original owner doesn't mess on it while the filesystem server is using
it.  This is a security measure, although I am not sure that this is
still expected to be needed in recent designs (I can't really think of
a need for it, but maybe I am missing something).

Memory can be shared privately, or not.  In the former case, the
question is if the user sees changes to the file made after the
instantiaton of the mapping or not (if he should not, then the system
must make a snapshot of the file at the time the mmap is done, and the
obvious question is who pays for the cost of the snapshot?  The user
should, but that's expensive.  If he should see later changes to pages
which have not yet been mapped in due copy on write, which seems to be
allowed by POSIX, then it might be some weird behaviour that programs
are currently not expecting).  If the page is not shared privately,
changes must be written back to the filesystem.  The question is when
and where to do that.  It must be the user's responsibility, which
means that such changes can get lost if a kill -9 happens.  All of
these are interesting implications of our design, which seem to be
allowed by POSIX (or not, we are not 100% sure).

Then you have the cache issues.  Filesystems will keep their own
reference to a read page as a cache.  If they have to drop it due to
memory pressure, it will be turned into a soft reference.  If the user
changes or drops the frame, then the soft reference will become stale.
Otherwise the filesystem can later reclaim the data without reading it
from disk again by turning the soft reference back into a hard
reference.  This reference stuff for caches is an important
optimization and application of shared memory, that must be supported
by the interface (because only physmem can keep track of such frames).

Physmem also will provide an interface that can be used to swap out
data (you could swap data out yourself to some private swap space, but
then the frame becomes necessarily _unshared_, defeating some of the
above mechanisms and optimizations --- still, private swap space is
useful for sensitive data, which you may want to encrypt etc, but of
course in that case you also will not share them, or only share it via
your own special mechanisms).

These are just some of the details that, surprisingly enough, I manage
to keep into my head for perusal of Hurd-L4 related design ideas and code.

> >Still, one
> >way to catch up and challenge us at the same time is to ask questions
> >about the design (like, why do you need to do this thing here and why
> >is that thing there necessary at this place).  The more specific your
> >questions are, the better.
> 
> I've been trying to figure out how these container things work.  For a
> basic operation, I think the idea is that the client process (what do
> you call these things? I basically mean a running program/server/
> module) would ask physmem for a new container of the appropriate size;
> it would then give this container to the server module, who would put
> the data into it, or get the data out; the server would then give the
> container back to the client; the client would then probably have to
> dump the container that was used for the transaction (I'm assuming
> that there will be some memory pressure here).

Well, see above.  Your description is basically correct, although the
sever will only get limited access.

Data can be copied into a container by just installing frame
references from other containers, so the copy is logical and the data
will be effectively shared (with copy on write, probably, etc).

> Assuming I've got all that right, there will be quite a few trips
> through the kernel involved.

Did you write kernel by accident here?  Let's say there will be a
couple of RPCs among the tasks involved, yes.

> For a basic file system operation
> there's going to be, at very least, three processes involved in
> getting a block to disk - the client, file-system and device driver.

And physmem ;) The device driver will only be consulted if there is a
cache miss, and if there is a cache miss, you are of course already in
the slow path.  The fast path stops at the filesystem cache.

> If these are all going to have to have an extra journey into physmem
> to move the data around we're looking at least 10 context switches
> between processes before we've even done anything interesting.

Remember that L4 is all about fast IPC and fast context switches.
It's questionable how much of that speed we can actually make use of,
given that our RPC overhead will be considerable (this is also
partially a fault because L4 does, IMO, lack some primitive support of
cancellation of RPCs - something that could possibly be fixed by some
small extensions to L4).

In a more sophisticated model, containers could be set up and shared
in advance, if you want even several at the same time in one go.  This
can cut down the number of operations considerably.

I don't want to make a full list of all operations.  It's true, it's a
lot.  And you didn't even think of all of them, as there are at least
a couple of operations involving three tasks when a capability is
transfered from one task to another.

Beside combining several RPCs into one (which shouldn't be necessary,
and doesn't, after all, give you much of an advantage, at least it
shouldn't), you can for example try to use 4 MB super pages for file
access, to cut down the number of cache misses (at any level).

> If we're expecting any sort of state back (I think this will happen
> quite a lot as well) then most of the above process would have to be
> repeated - that's a whole lot of cache and TLB thrashing going on!

Yes, probably.  Using containers is not exactly cheap.  We are aware
of that, but we do not see a way to cut back.  Every single operation
seems to be necessary.

> If
> this is a multi-CPU system, I hope that the IPC will only happen
> between single CPUs - otherwise you're going to be getting a lot of
> contention between the CPUs caches as well.  The reason for limiting
> the operation to a single CPU is that the operation is fundamentally
> serial in nature, so moving work across to other CPUs is just going
> to give them lots of unnecessary work to do.

Well, SMP raises a hell of a lot of issues, most of them not well
understood in a multi-server system like the Hurd.

> If things actually end
> up blocking, however, like if we start having to wait for a block
> to come back from the disk, then we can start taking over work from
> other processors.  But if the general course of action is to send
> things to the least used processor things are going to get very slow -
> witness all the the stuff that's been happening in Linux recently with
> trying to get processes to "stick" on one processor.  I've put quite a
> lot of thought into moving these sorts of decisions back out of the
> kernel, but that's for another discussion!

We still need someone to design and write the scheduler!

> Anyway, hope that's specific enough!  As I said, I've never actually
> written one of these things before, so it's conjecture that any of
> these things will actually be a problem.

We are in the same learning process.  It's not as if we are
voluntarily adding overhead to the system, just to have nice
abstractions.  From our point of view so far, the abstractions seem to
be strictly necessary in a multi-server system with untrusted tasks
working together.

It is this exactly this which can make all talk about L4 performance
deceiving: Barebone L4 IPC is so minimal that its performance doesn't
reveal anything about actual RPC performance in the context of a
fully-featured operating system.

Now, you could blame the Hurd, for example that we stick brutally to
the POSIX semantics of certain operations.  But we are more or less
directly competing with existing, POSIX compatible operating systems,
so this is our challenge.  And we all have the warning that those who
ignore POSIX have to reinvent it poorly :)

Here is an example, this time from the capability system: We need to
make most operations cancellable.  This means that if a signal arrives
that should be handled by the thread, that the RPC operation is
properly canceled.  However, this can only be done with server-side
support (the server must be informed about the cancelation - if we
simply abort the IPC without letting the server know, we are violating
the protocol and leave our client-server connection in an undefined
state.  The server would have no way to find out that we canceled it
without trying to send a reply, at which time the reply would get
lost, which is sometimes fatal for non-recoverable, non-repeatable
operations).

But, L4 does provide little in support of such RPC cancellation.  The
IPC mechanism in L4 requires us to use zero-timeouts on the server
side to prevent DoS attacks.  This means that if the client thread
must block on the receive after the send, and must not be stopped or
aborted.  I don't want to go into details here, you can find lots of
mails from me on this list and on l4ka@ about the race conditions
involved.

The point here is that we wouldn't have this problem in the first
place if we would not want RPCs to be cancellable.  But, this would
mean that you couldn't interrupt a read() from a pipe with ^C for
example.  IOW: the resulting operating system would suck from a
usability point of view.

The only way to implement cancelation semantics on L4 correctly
involves a task-global lock which must be taken and released before
and after every RPC you are doing.  Now, there can be lots of reason
why we need such a lock anyway (signal handler stuff, critical section
issues, etc).  But in any case, this is overhead for every RPC
operation that does not occur in any analysis of the raw IPC
performance in L4.  Similar for setting up trusted buffer objects like
containers (google for "IPC-Assurance.ps" to find a paper by Shapiro
explaining issues related to that).

> >>Why have you gone for a micro-kernel based design? :)
> >
> >Because we believe in the benefits of it,
> >and nobody else seems to be doing it.
> 
> Sorry, that comment was made in jest (hence the smiley)!  I totally
> agree with you that a micro-kernel based design is the way to go.

I am more careful than you.  We have not yet proven it.  We all want
it to be the way to go, but there are serious obstacles ahead, and you
have described some in your mail, which is fine work by you.  Keep it going!

Yet, look at the Hurd running on Mach.  It is slow, but it also is
tolerable in many situations.  And the implementation of the Hurd on
Mach is, with respects to performance, often careless, and downright
harmful, and definitely not very optimized.  So, even if using a
container has a ton of context switches, we should not let this
discourage us from even trying it.  Actual performance is measured in
the field.  So let's work on the implementation, and then look at the
result.

In particular, we do not look for implementing a high-performance
system.  We can perfectly live with a small performance hit in a first
implementation.  The performance of systems like Linux is also the
work of year-long costly efforts to benchmark and optimize various
aspects of the system.  It's not something you can do first, it needs
to come after the overall design and substantial amounts of
implementation.

Thanks,
Marcus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]