Re: [Bug-apl] Prime Performance and others

I think this is a big issue. It would be great if Juergen could get involved with your project!

Blake McBride

On Mon, Aug 31, 2015 at 3:19 AM, Elias Mårtenson <address@hidden> wrote:

I finally ran the full test case again, and I will concentrate on a piece of data:

The constructor IntCell::IntCell(long) alone is responsible for 19.65% CPU time. What's most revealing is that it's being called 6210421276 times. For those who don't like counting numbers, that's over 6 billion times.

We also have the pair IntCell::get_cell_type() and IntCell::get_int_value() being called the exact same number of times.

My takeaway from that is that we have an excessive number of array duplications, creating arrays that are just traversed a single time, and then being dropped.

Anyone interested in the details concerning my original experimentation with this can search back in the mailing list archives, but here's what happens:

Let's take a simple _expression_:

foo ← 1+2×⍳N

This performs the following operations internally:
Allocate array of size N and fill it with the sequence 1, 2, 3, …
Create a new array of size N and copy the values from the previous point
Multiply all values by 2
Create a new array of size N and copy the values from the previous point
Add 1 to all values
Create a new array of size N and copy…
Store this array in variable foo
My original work attempted to mark arrays as "mutable". For example, the value "⍳N" would be mutable, since they are safe to modify in-place, while the result of "foo" would not be, since they should not be changed in-place. Thus, 1+⍳N should be able to create an array and update it, while 1+foo should copy the array before the addition. The implementation was similar in concept to your vanilla copy-on-write algorithm.

My solution worked for simple cases, but had some very serious bugs that caused arrays to be changed that shouldn't be changed, which is why it was never integrated. However, for the cases where it worked, the performance improvement was massive.

I have no current plans to resume this, but it would be interesting to discuss it if someone else wants to play around with it.

Regards,
Elias

On 31 August 2015 at 00:36, Mike Duvos <address@hidden> wrote:
Hi Elias,

It appears I screwed up my previous benchmark of SIEVE. I apparently forgot I had the XTerm on my desktop logged into an instance of Ubuntu Server LTS on Amazon Web Services, and ended up comparing APL2 on my desktop with GNU APL on a much faster processor.

I've built SVN 669 on my Dell Dimension 2400, an older slower 32 bit machine,
and now get the following results on that box for SIEVE 100000. 4.219 seconds for APL2, and 812.4 seconds for GNU APL.

So GNU APL is about 192.557 times as slow. There's a bit of variability in the time for GNU APL. If I put my ear to the box, I can hear the disk being flogged on, and Task Manager shows the memory bouncing around between 8 and 10 meg rapidly, so it's not necessarily true that the processor time is equal to the wall clock time.

I'm not sure how much time you'll get back by reducing redundant array copying. Any commercial APL will special-case "reshape", "and", and :"index of" on packed boolean vectors, and this is always going to beat GNU APL's object-oriented approach of boxing everything and making no assumptions about the types of values.

Looking back over the bug list, I see only two unresolved issues. Building shared libraries under Cygwin/Windows, and specification sometimes giving the wrong answer when you combine mixed structural primitives with bracket selection.

The shared library thing is important, because with AP210 not being completely implemented, Windows users currently have no way to get ASCII files in and out of their workspace.

Regards,

Mike

On Sun, Aug 30, 2015 at 4:20 AM, Elias Mårtenson <address@hidden> wrote:
All right, I have re-run the Sieve benchmark on GNU APL as of svn id 669.

I do see differences, but they are not great. I also ran it on a reduced case in order to have it finish quicker, so the breakdown could be incorrect. I will re-run this on the full (SIEVE 100000) test case and report back.

I can still see Clone taking ridiculous amounts of time though.

Regards,
Elias

On 29 August 2015 at 11:42, Elias Mårtenson <address@hidden> wrote:
I never analysed the Sieve benchmark. I'll take a look at it tonight.

Regards,
Elias

On 29 Aug 2015 06:51, "Mike Duvos" <address@hidden> wrote:
Hi Fred,

I just built SVN 664 on the Dell Dimension I did the original factorial benchmark on. The time went from 52 seconds to 23. So that is indeed a dramatic improvement.

I wonder why my Prime Sieve didn't speed up much.

Regards,

Mike

On Fri, Aug 28, 2015 at 11:28 AM, Fred Weigel <address@hidden> wrote:
Juergen

I just built 664 -- the performance of the factorial 300 that Mike Duvos
posted has indeed improved enormously - now running 8 seconds instead of
20 on my reference platform.

In line with expectations from Elias' profiling results.

Very well done!

FredW

From:	Blake McBride
Subject:	Re: [Bug-apl] Prime Performance and others
Date:	Mon, 31 Aug 2015 09:03:37 -0500