[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: self-paging

From: Bas Wijnen
Subject: Re: self-paging
Date: Wed, 21 Dec 2005 18:36:56 +0100
User-agent: Mutt/1.5.11

On Wed, Dec 07, 2005 at 12:06:38PM -0500, Jonathan S. Shapiro wrote:
> > No, I haven't.  If two processes A and B both hold a page in memory, then
> > A can throw it out of its own address space.  It cannot force the page to
> > be really swapped out.  Assume A has 10 pages, 3 of which are swapped out
> > (so only P[0]...P[6] are actually in memory).  P[5] and P[6] is are shared
> > pages, P[8] is not shared, and there are no free pages at all (I think
> > this is a rare thing, but it does happen every now and then of course).
> > Now A exchanges P[6] and P[8].  Then its quota immediately decreases by
> > one, so only P[0]...P[5] are still in memory.  B doesn't notice a thing.
> Perhaps this is what you intended to describe all along, but it's a
> significant change from what I understood. What you now appear to be saying
> is that a frame pool is really a sponsorship mechanism. A given frame can be
> dominated by different frame pools in different address spaces, and if so it
> may be multiply sponsored. It remains pinned in memory as long as it is
> sponsored by at least one frame pool.

Yes, this is what I was trying to say.  Sorry for not making myself clear.

> Conceptually, this is a workable design. We are going to run into two very
> complex implementation issues:
> 1. Underutilization.
> We will soon conclude that every RT process must sponsor libc (actually,
> libc may be important enough to pin as a system matter, but there will turn
> out to be some library that gets multiply sponsored by a bunch of apps, but
> isn't important enough to sponsor globally). This requires a surprisingly
> large number of sponsorship tickets.
> The problem with this is that the overall total number of system sponsorship
> tickets must never exceed the number of pinnable frames.

This is only a problem when there are a large number of real-time tasks on the
system (where a real-time task is defined as a task which can get a non-zero
quota on hard resources).  I wasn't planning to allow that.  It's perfectly
fine with me if there's a limit of say 30% of all the memory which can be
pinned this way, and the system will simply refuse to make the guarantee after
that (which doesn't mean the program will not work, of course, but it may mean
that for example the CD-burning program will refuse to start).

To get rid of even more of the problem, we can allow a process to register a
read-only page as "permanent", meaning it will never swap it out (obviously
only real-time tasks can do that, since others can get a quota of 0, meaning
they are completely swapped out).  If they do that, the system knows it can
safely give that part of the guaranteed quota away as long as that page is
shared (and when it is no longer shared, the quota of the task which stops
sharing drops, so there is no problem with quarantees).  I'm not sure if this
is very clear.  I hope the list of operations below helps with that.

> 2. Reverse fan-out
> This is a kernel implementation issue. We will now need a mechanism to
> back-trace from a physical frame to all of its current sponsors. This is
> very hard to handle under a constant storage constraint.

This is the kind of problem I was talking about, which I described as
"solvable".  I have no idea how, but I thought it should be.  Do you think it
may be impossible to solve (given the constraints)?

> 3. Checkpointing
> Pinning a page isn't enough. We also have to ensure that it remains promptly
> accessable. If the page is clean this is no problem. If the page is dirty
> and subject to checkpoint, it becomes subject to in-memory copy on write
> behavior and we need to ensure that we reserve enough frames for this to
> happen. One saving grace here is that the redundant sponsorship will tend to
> lead to lots of available frames that can be used for copy on write
> purposes.
> In most RT situations requiring large amounts of memory, however, the bulk
> of dirty pages will be exempted from checkpoint. There is no point
> checkpointing a frame buffer or audio buffer, for example.

I don't know enough about the checkpointing mechanism to fully understand this
problem, but here're some ideas about it:
The checkpoint is not an unexpected event.  So it should be possible to
reserve the needed frames for this purpose before starting the checkpoint
procedure.  The maximum number of pinnable pages may need to have some upper
limit to make this possible.  I wanted a maximum anyway, and obviously there
is an upper limit of "all memory", so I don't think it should be a problem if
that limit is a bit lower.

> > > The reason your design has this problem is that you have the quotas
> > > attached in the wrong place. What you need to resolve this is a
> > > subdivisible quota mechanism that allows you to attach different quotas
> > > to different pools. A shared pool of frames needs to have exactly one
> > > quota, and this quota should not (usually) be used to cover any other
> > > pool.
> > 
> > If you have a pool of pages which is shared, and residency is a property
> > of the pool, then who will decide when to swap out a page?
> The fault handler for the pool. The key idea here is that if you and I are
> sharing a resource, and we both have a residency requirement, we can often
> manage the residency issues jointly. Perhaps more important, we want to
> separate the sponsoring container from the process so that a multithreaded
> application can manage sponsorship.

In case of cooperation between the sharers, this makes sense.  But I think in
case of cooperation (and communication), things will work out however they are
organised.  My thoughts go mostly to sharing between tasks which are not
allowed to communicate, for example two random processes using the same shared
library.  If there is a manager of the shared pages, it would be the file
system.  But that's about as bad as the kernel: the file system knows nothing
about the needs of the sharing processes, and thus cannot make a good decision
of when to swap out the pages.  So what we want is to give the decision to the
tasks themselves.  But they must not be allowed to really swap out the page,
because that would be a denial of service to the other sharing tasks, and a
means of communication.  My proposal gives the decision to the tasks
themselves, but doesn't have these problems.

> > I don't think any of these can both make the correct decision and be
> > trusted enough to get the power.  In my system, a page is swapped out when
> > no process needs it enough that it will use its quota for it.  That sounds
> > like a good criterion to me.  If you want to implement this with the pool
> > approach, then there needs to be an extra accounting process which seems
> > to add a lot of unneeded complexity.
> I think you are forgetting that *any* process that can pin resource is
> partially trusted.

I don't see why.  If I (as a user) want to see what a certain program does,
and it needs to pin some pages, then I can decide to give it a quota for that.
That doesn't make the program trustworthy for the system though.  It just
allows it to pin the resource it received from me, and only until I revoke it
again (most probably by killing the program).

> This tends to make the problem a little easier. In truth,
> I'm not so convinced about shared management. What I *am* convinced about is
> that I want both a finer granularity of control than "one process, one pool"
> and also I want to be able to share a pool among threads. This means that
> the pool abstraction needs to be first class independent of the process
> abstraction.

What you call a "pool" seems to be what we so far call a "container".  Do you
know if this is correct?  If it is, I fully agree with you that containers and
processes are orthogonal, one process can hold several containers, and one
container can be held by several processes.

> > > It's a granularity issue. You want the granularity of residency to be
> > > independent of the granularity of either process or address space.
> > 
> > I want each process (the collective threads of an address space) to decide
> > for itself which pages it wants in memory.
> We have a term collision. Remember that EROS doesn't have threads. The way
> you set up conventional multithreading is to have multiple processes that
> share an endpoint and an address space.

Ok, so when you say process you mean what I call a thread.  I usually don't
mind adapting to your word choice, but in this case that would remove my
"process" from the vocabulary, which is a collection of threads (possibly just
one) which share an address space (and have some communication).

> But come to think of it, you also want sharing. You want to be able to have
> two codecs in your video player. Both rely on libc, but you don't want
> different parts of the same application sponsoring libc redundantly.

In fact, I don't mind at all if they do.  The (soft) quota of all process will
be adjusted anyway, so the memory will eventually be used by someone.  Given
that we have solved the problem of sharing between processes, I think we have
also solved sharing within a process.

> As long as we are talking about behavior within an application, this is all
> very doable without any management problems.

It is, but there's no need for it.  The application can just rely on the
system to do sharing well.

> >   If processes on the system will share pages, then after filling all the
> >   quota there are still some pages available.  So we increase the quota
> >   (using the usual priority scheme, so if A shares a page with B, it is
> >   very well possible that neither A nor B gets the extra quota that
> >   results from it).
> You cannot do this. What will happen is that the processes will expand to
> *use* this quota, and at some point somebody will discover that the
> guarantee was a lie. Think about it.

I did think about it, and it does work. ;-)  Here's a list of operations which
a process should be able to perform.  The operations are on some kernel
object, which keeps a list of the pages the process holds.  This list is of
some fixed size (but different for different processes), which is the maximum
number of pages the process can map into memory.  Most probably the
implementation is using the sparseness of it (usually only the bottom of the
list is populated with "real" pages), but that's an implementation detail.
There is a "quota", which is the current number of pages which is resident in
memory.  This is a variable number.  There should probably be a notification
when it changes.

A - Get the total number of pages of the list.
B - Get the current quota.
C - Swap two pages.  This instantly decreases the quota by 1 if exactly one of
    the swapped pages was resident, and the quota was higher than the
    guaranteed quota (for real-time tasks).
D - Move a page to a different position, shifting the rest of the list.  Of
    course this can be implemented with swapping only, but I think this would
    be a useful direct operation.  As before, the quota immediately decreases
    by 1 if the page moves from swap into memory.
E - Mark a page.  Current markings I was thinking of are "cache", meaning it
    can be discarded instead of swapped out, and "permanent", described above.
F - Map a page into memory.  This can only be done for pages within the quota.
    I'm not sure if this operation is needed at all.  I know it would be
    needed in L4, because it isn't possible to map a page into an address
    space without the receiver performing an IPC (receive).

Because C and D immediately decrease the quota by 1 as part of the contract,
there is no lie.  Processes know: I have 100 pages, and my current quota is
60.  If I want to look at a page from swap, my quota will be 59, at least for
a moment.  Usually, the process will get the page back again the next time the
quota are recalculated (every few seconds), but in case it was a shared page,
it will simply be lost (because exchanging a shared page for a shared one
really decreases the number of available pages by one).

> Yes, if the two processes are acting as two threads of a single application.
> This is the thread/process term collision again.

Ah, you say "application" where I say "process".  I use application for bigger
things, which may consist of several processes (for example a shell script can
be an application, although it starts many threads which don't share an
address space).

> > > > You said before that the list management would need to be in the
> > > > kernel if it would be implemented.  I now agree with that. :-)
> > > 
> > > Good. Now tackle the *rest* of the challenge problem: how to manage this
> > > list sensibly in constant per-frame storage in the face of sharing.
> > 
> > What exactly do you mean with "constant per-frame storage"?  That a shared
> > page has only one mapping to disk?  I was always assuming that.
> Neither. I meant that you need to find a way to (a) record all sponsors of a
> frame, but (b) accomplish this MxN relationship (M sponsors, N frames) in
> O(N) storage.

What's the problem of O(MxN) storage?  I am assuming that every user pays for
the storage, no matter if it is shared.  While this may seem a bit wasteful, I
think it's about very small amounts of storage, which make this a non-issue.

Is this the "reverse fan-out" problem you described above?


I encourage people to send encrypted e-mail (see http://www.gnupg.org).
If you have problems reading my e-mail, use a better reader.
Please send the central message of e-mails as plain text
   in the message body, not as HTML and definitely not as MS Word.
Please do not use the MS Word format for attachments either.
For more information, see

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]