discuss-gnustep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: objective-c: how slow ?


From: Marko Mikulicic
Subject: Re: objective-c: how slow ?
Date: Sat, 01 Sep 2001 16:56:50 -0400
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801

Nicola Pero wrote:
> Here is a rough thought about how we could implement an inline cache
> optimization for the Objective-C runtime, without using self-modifying
> code (or better, by using the C equivalent of self-modifying code):
>
> you would modify the compiler so that instead of compiling
>
>  [receiver message];
>
> into
>
>  {
>    IMP __imp = objc_msg_lookup(receiver, @selector(message));
>    __imp (receiver, @selector(message));
>  }
>
> it would compile it into
>
>  {
>    static Class __class = Nil;
>    static IMP __imp = NULL;
>
>    if (receiver->isa != __class)
>      {
>         __class = receiver->isa;
>         __imp = objc_msg_lookup(receiver, @selector(message));
>      }
>
>    __imp (receiver, @selector(message));
>  }
>
> it's a rough sketch, but I suppose something like this might actually
> work! :-)
(nil receiver must be explicitly handled)


Actually, the Portable Object Compiler (POC) does something alike
(it is a preprocessor).

But this shema will be trashing when frequent cache misses occour
(interleaved classes in an array for example). Polymorphic caches would perform
far better, although single call is a bit more expensive:

{
  static struct PIC_call_t __picdata;

  if(receiver->isa == __picdata.class1)
    {
       __picdata.imp1(receiver,@selector(message));
    }
  else if(receiver->isa == __picdata.class2)
    {
       __SWAP(__picdata.imp1,__picdata.imp2);
       __SWAP(__picdata.class1,__picdata.class2);
       __picdata.imp1(receiver,@selector(message));
    }
  else if(receiver->isa == __picdata.class3)
    {
       __SWAP(__picdata.imp2,__picdata.imp3);
       __SWAP(__picdata.class2,__picdata.class3);
       __picdata.imp2(receiver,@selector(message));
    }
  else
    {
       __picdata.imp3 = objc_msg_lookup(receiver, @selector(message));
       __picdata.class3 = receiver->isa;
       __picdata.imp3 (receiver, @selector(message));
    }
}

Of course, handcrafted assembler would result in smaller code
(for example pushing arguments at the beginning, exploiting
branch delay slots if the architectures permits, avoid stalls
and icache misses prefetching target imp while maintaining LRU queue,...)

There are other ways for mantaining a LRU queue, this is also a sketch.
It's difficult to migrate a techinque from a dynamic compiled language
to a static compiled one: many informations that are present in the Self
runtime system are not cheap in objc (profiling informations used by the
inlining compiler).

>
> but only if you don't use multithreading.  In multithreading, it is not
> actually going to work, unless we add locks, but we can't.  Any idea how
> to make it work in multithreading ?

You could, for example, maintain profiling informations
(just the count of method calls on a persite basis).
 If a thread is preempted just before writing the incremented value
then a method call accounting will be lost but who cares, we are talking
of millions of method calls; if none calls it later then it wouldn't be
a good candidate for a PIC however.
 PIC update could be done by a separate thread (the GC for example,
while traversing stack frames for pointers to locals or parameters)
while freezing all other threads (this could be done incrementally
to minimize delays).
 But this assumes some mechanisms that are part of Smalltalk/Self runtime
and not of objc, and maybe
adding them to objc would bring the same performance it has now.
 However, the assumption can be that we can be conservative in determinig
wich method calls should be promoted in a PIC, and PIC updates can be
deferred.

>
> I suppose we might add it the compiler and it could be turned on with a
> certain command line flag.  That flag wouldn't work with threads, so if
> you compile with the flag, you mustn't use threads.  Some sort of
> `thread-unsafe' optimization.

It could be useful.
You may also use this in a threading environment, wich coarse handcrafted
locking and linking with a category compiled with this flag, but this
would become dangerous and tricky in some cases.

>
> Btw, notice that a hand-made optimization done directly using IMPs in
> Objective-C is faster than an inline cache hit, so in (the extremely rare)
> places in which you might need a terrific messaging speed (which,
> reasonably speaking, *must* be in a tight loop (and so optimizable by
> IMPs) because unless you send some tons of millions of messages, who cares
> if it takes 30% more or less time to send a message), you would want to
> optimize directly using IMPs anyway.

I have this specific situation:

I have modified gstep-objc to better support custum attribute types
(money, custum calendar dates). The current code hardcoded classes to database
types, but the framework (EOF) lets the user choose a specific class for a given
attribute (of course with a compatible protocol). My modifications require more
OO, and when hudge data loading is involved the overhead of transforming adapter records in application object becomes visible.
 The messaging is not done in a thight loop but still most of the methods are
called for common targets.
I think this is an example of a situation where optimizing by hand is not possible, because framework interaction is not trivial.
 Even if the targets are temporary and spatially dislocated, the size of todays
icaches is big enough to hold more than a level in the call graph, and also
you watch performance implications on architectures with different argument passing shemas (register window: sparc, itanium, pa risc, and more to come).
 x86 is unhappy as a base of design as it can perform amazingly but freezes the
development of new tehniques because the architecture imposes rules which produce code under this rules, which in turn justificate new design assumptions. ex: we should not add more registers because the encoding of instructions would be larger, leaving less space on die to dcache for the stack, which performs quite well in the place of register, if using the appropriate runtime hardware code transformations (ex in ppro core) which inside is much like a risc machine .... This indeed performs well, but for old code and new code is pushed to be like to old one. This situation is more relaxed than in the past thanks to compiler development but some assumptions still reigns: function calls are slow,
leave your call graph so shallow as you can, ...

 OO code can do amazing things, it's a shame that its price is so high.
Often we make a treadoff between performance and flexibility; the problem
is that when we hardcode this tread off we loose future developments.
If this balance is done at runtime the system stabilises very quickly
and then performs like coded by hand, changing automatically when his environment requests it.

I admit that in many cases manual optimization does very well, surely better
than PICs.

Sorry for this hudge post, I'll learn to type gzipped :-)


Marko




reply via email to

[Prev in Thread] Current Thread [Next in Thread]