discuss-gnustep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: objective-c: how slow ?


From: Marko Mikulicic
Subject: Re: objective-c: how slow ?
Date: Sat, 08 Sep 2001 03:48:06 -0400
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801

Malmberg wrote:
[snip]

An alternative could be placing a full IC
code inline but with the first instruction a jump to the patching routine, which
will restore the first instr. of the IC and update the values known only at 
runime.


There are threading issues with that. A full jump is 5 bytes, which is
more than can be patched atomically (or are the 80 bit FP stores
atomic?).

FST (64-bit) should be atomic (at least on single processors, I don't know on SMP if intel processors lock the bus for a full quadword write, and there are cache coherency issues).

You could arrange first arrange the jump to a routine which will yield any the calling thread. (must be temporary generated because the return address would be unknown).
 If another thread is already executing the stub code when it is stopped, then
atomically modify the last call instruction of the stub to call some generated routine that sets some variable and then calls the old function, and spin on the var. That's tricky and not worth the effort (expecially when considering page fault, see later) but this noncooperative locking can do the job when the updating is not done often.



No you can't, since at some point you have to put the return address on
the stack. Manually pushing the return address costs (for whatever
reason) ~15 cycles, so that's not an option. Calling a local stub will
be no more efficient than calling the outline stub.

I don't understand. You inline a call, and this call will push the ip. The
call will return and you go straight off.


Anyway you would be writing in the text segment, which would most
probably arise in a copy on write page fault.


That will be a problem with any IC. A second level of indirection would
help a lot, but would probably cost ~0.5 cycles.

Yes, but you have control on where the IC are. If the patcher is in the sarray
you never know who will patched (potential callers can be anywhere in the code)


 Your solution is more easy to implement because the compiler itself needs not
to be changed but can give serious performance problems that can be hard to
predict because they depend of the execution profile of the program.


I don't think they'd be hard to predict. After calling a runtime
function, every call to those method implementations causes the caller
to be patched (memory[return_address-0x4]=selector->stub_adr).

yes. but you don't know who are the callers. IC optimization is caller based, not target based. Some callers can benefit from it, others can only loose (imagine you have two arrays, one with 90% of objects A and the other with only a couple of objects A. Somewhere in the code a pass though the second array is done, calling some method of A which patches the code, thinking that I likes to be optimized for A, but it's wrong. The coder predicted the first array to contain likely objects A so instructed the runtime to generate a IC for calls in
 some places and not though all the system).

There'll
be a 4k page copy the first time for each page, but if you can't live
with that, you simply can't use selfmodifying code (or you could load
the entire program and its libraries non-shared, which would be
predictable but cost memory and load speed).

You proposed to have stubs in separate regions and JMP to it. This is perfect,
because in 4KB there can be a lot of stubs. It's also thread safe, because you
can atomically change the address of a stub in the caller.


[snip]

Doesn't forwarding and DO count as delegation?

Not really. Delegation uses the delegate to lookup the method
but forwards the original receiver.
(One example is better than a thousand words. see attachment)


Forwarding doesn't do that (currently), but it would be trivial to add a
flag that tells NSInvocation to set self of the called method to the
object doing the forwarding. If that isn't fast enough, it would be
possible to add a objc_add_delegate_for_class(someClass,theDelegate)
function that added all methods in theDelegate to someClass if it didn't
already have a method for that selector. You could even do it for an
individual object.

You are right. Objc can do that. Self implements inheritance (SI and MI) through
"delegates" (This termin has different meaning in the Self and Objc world).
 Unimplemented method calls are forwarded to the delegate who responds to it.
If more than one delegate responds to it an exception is raised. This is the reason Self lookup is so slow and why PIC are needed. Self was an experimental system, they wanted to see what could be done if some language features, turned down as expensive or mind-blowing in classical languages, are instead made efficient.

It's a bit dangerous, though, since the receiver
probably expects self to be of the correct type.

The framework is build accordingly. However ivars should not be accessed 
directly.


[snip]

I admit I don't know in detail the lookup mechanism in the objc runtime


In short, each selector gets an index, and each class has an array of
implementations for selectors (implemented as a two-level sparse array,
but that makes no difference). The lookup uses the index to get the
selector from the array (and if there isn't one, it goes on to do
forwarding etc.).

It is like I was thinking.
The thing I was not expecting is that selector maps are per class, requiring
inheritance graph walk for every miss.
This slows down also delegates (forwardInvocaton) wich must be called after a full cache miss (until the root class). Is this true?

The speed of this sarray impementation leaves me without breath.

However, has anyone benchmarks on PPC, alphas, sparcs, other ?.
It seems to me that the balance between lookup and overall method call
on intel is kept by the relatively slow function call machinery (sparc register windows speedup parameter passing, and alphas return addr registers advantages leaf functions).

(I will test on alpha and sparc as soon as I get home from work (a couple of 
weeks))

Marko




reply via email to

[Prev in Thread] Current Thread [Next in Thread]