[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallel APL Questions

From: Dr . Jürgen Sauermann
Subject: Re: Parallel APL Questions
Date: Sat, 8 Feb 2020 20:10:39 +0100
User-agent: Mozilla/5.0 (X11; Linux i686; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hi Andrew,

thank you very much for your interest. See below...

On 2/8/20 6:17 PM, Andrew wrote:
Hello Jürgen

Thank you for the quick and most comprehensive reply.  Much appreciated.

I was not aware of ⎕FIO ¯1, but I shall try using it.  I assume that it will not work on the emulated S/390 (and possibly not on the ARM processor that I will also try at some point).

I did wonder whether the _expression_ I used would result in a mix of parallelization.  However, I am sure that you are making a good point regarding memory accesses.  (I have done some work on using GPUs to accelerate number theoretic workloads; in that, memory latency is very definitely a big factor.)  It seems conceivable to me that attempting to parallelize a small part of an _expression_’s execution could actually result in less favourable cache utilization, the negative effect of which could outweigh any parallelization benefit, although I am not sufficiently familiar with Intel cache behaviour to be able to say that with much confidence.

 ⎕AI takes the entire session time and subtracts the different times where the interpreter is blocked
waiting for user input. This is just good enough for accounting purposes but nothing more.

⎕TS is the current time, bit its precision is limited to milliseconds (and it rounds to milliseconds).
It is OK for measuring timing intervals that are several seconds long but not for sub-second
measurements. ⎕FIO ¯1 is CPU cycles whch is very precisses as long as you don't fiddle with the
CPU frequency. ⎕FIO ¯2 is a rough estimate of the CPU frequency. it sleeps for 100 ms or so and sees
by how much ⎕FIO has increased during the sleep.

Note that ⎕FIO ¯1 is not only the mst precise one, but also the one with the lowest overhead.

If you will forgive a few follow-on questions (I have now looked at your README files rather more carefully):

1.  Is there anything better than ⎕AI that you would recommend on non-Intel processors?  Your e-mail suggests that ⎕TS may be more accurate?

Yes. Use ⎕TS and make sure that your execution time is at least several seconds. I remember that we
used ⌹?10 10⍴10 in the 1970s to compare APL interpreters. It took around 10 seconds on a MicroAPL
APL 68000, but only a fraction of a second on an IBM 370 mainframe. So for MicroAPL ⎕TS was OK while
for the 370 it was not.
2.  README-8 refers to the parallel_threasholds file.  Is this file used in normal operation (i.e. if all I have done is specify CORE_COUNT_WANTED=syl) on ./configure?

The thresholds are always used and optimal values vary a lot between systems, even between
different CPUs of the same vendor.
3.  It contains a line: perfo_3(OPER2_OUTER,   _AB,  "A ∘.× B",198                  ).  Is this threshold of 198 used for all outer products, or just multiplication?  If only for multiplication, what threshold would be used for the modulus (stile) function?

The threshold is used for all primitive functions (outer profucts of defined functions are not parallelized),
Same for inner product. The number refers to the result size.
4.  You mention that some ./configure options should be switched off to minimize performance impacts.  Are any of these options not switched off by default?  (I looked at README-2 and could not see anything obvious, but thought it worth checking with you).  The only option I used was CORE_COUNT_WANTED.

Actually I don't quite remember. README-2 was written long ago and I keep forgetting things.
I suppose ASSERT_LEVEL, VALUE_HISTORY and dynamic logging have an impact. There is a
make target called parallel1 which does ./configure with all options set to get the best
performance with parallel execution. Just say:

make parallel1

(after having run ./configure at least once so that all Makefiles were created),

Many thanks once again for your assistance.

you're welcome.



On 7 Feb 2020, at 19:25, Dr. Jürgen Sauermann <address@hidden> wrote:

Hi Andrew,

let me try to answer some of your questions inline below...

On 2/7/20 6:35 PM, Andrew wrote:
Good evening

This is my first post to this mailing list.  It is a mainly some questions, not a bug report, so I hope it is appropriate to post it here.  Apologies if not.  (And apologies also for a rather long and rambling e-mail!)

No problem, youu found the right list.
I recently learned of Gnu APL and, having had some experience of APL on IBM mainframes in the 1980s, I was curious to know how it would work on a couple of my computers, and to use it to compare performance of two virtualised and emulated environments.

Firstly, I installed it on Ubuntu 18.04.3 running under VMWare Fusion on a 2.3GHz 8-core Intel i9.  This is the latest SVN version, built using CORE_COUNT-WANTED=syl on ./configure (not make parallels, which gave me a problem with autoconf).  I then used ⎕syl[26;2] to set the number of cores.

Using ⎕ai to obtain the compute time, I tried using 1 and 4 cores for brute force prime number counting, using this _expression_: r←⍴(1=+⌿0=r∘.∣r)/r←1↓⍳n

⎕AI is rather imprecise, even worse than ⎕TS. For performance measurements on Intel
CPUs you should use ⎕FIO ¯1 (return CPU cycle counter) and maybe ⎕FIO ¯2 (return CPU frequency).
⎕FIO ¯1 is the most precise timing source that you can get in GNU APL.
Although I could see, on the system monitor, that 4 cores were being used, the execution time with n=10000 actually took longer for the 4 core case, typically 15-20% more time than the 1 core case.

The _expression_ above that you benchmarked is a mix of parallelized and not parallelized APL
primitives. Each of them is subject to varying execution times, so it is difficult to tell if the increased
execution time is caused by the parallel execution or by the anyhow varying execution times.
However, I then tried it in a very different environment: Ubuntu 18.04.3 again, but running in an emulated IBM S/390 mainframe (using the Hercules S/370 emulator running in Ubuntu in VMWare on a 3.5 GHz 6-core Xeon).  For n=5000, this gave the opposite result: the 4 core case was approx. 45% quicker.

In my experience using all cores of a CPU is not optimal because external events  from the OS (interrupts
etc) slow down one of the cores used for APL so that the CPU(s) hit by external events increase the
execution time of each primitive. If you leave one core unused (and if you are luck), then the scheduler
of the OS will see which cores are busy (execution APL) and will direct thos events to the unused core.

I also rather doubt that a virtual or emulated environment is able to tell anything about parallelized APL.
By the way there is a nechmarking workspace Scalar3.apl shipped with GNU APL that makes benchmarking of parallel GNU APL easier. Intel I9 is a good platform for running that workspace, but
avoid any virtualizations and ./configure it properly.

Directly comparing these two environments (one “simply” virtualized, the other emulated and virtualized) is not meaningful.  It is to be expected that the emulated one will be very substantially slower.  The more interesting point is, perhaps, that on the i9, using more cores actually slows it down whereas, in the emulated environment, which is effectively a *much* slower processor, using multiple cores does yield a modest speed-up.

The speedups that can be achieved are generally disappointing. I have also compared Intel I7 with intel I9.
Seems like at the same CPU frequency and with the same core count, the I9 uis substantially faster
than the I7 but at the same time the I7 benefits more from parallelization than the I9. Most likely the
CPU optimizations in the I9 (compared to I7) aim at the same kind of parallelism, so that improvements
of one aspect (CPU architecture) are made at the expense of the other aspect (APL parallelization)

I am not sure which components of the _expression_ (if any) would be parallelized by Gnu APL.  So my questions are:

1.  Is it plausible that, on a reasonably modern CPU (the i9), using multiple cores would slow down execution of this _expression_?
Could very well be. The _expression_ has a rather small amount of parallelization since the majority of
its primitives is not parallelized.
2.  Which of the operators in the _expression_ above would Gnu APL actually parallelize?
Currently all scalar functions and inner and outer products of them. One can proove These are the ones
that in theory and given the GNU APL implementation they must have a linear speedup (linear in the
number of cores). That is, on an I9 a scalar function on 4 cores must be 4 times faster than on one
core. In real life it is only 1.5 or so times faster. This points to a hardware bottleneck between the cores
and the memory. The scalar functions are so lightweight that the memory accesses (fetching the operands
and storing the results) dominate the entire execution time.
3.  Are there any configuration changes that I could make to adjust the way in which parallelization is done?

If you mean ./configure options by configurations then no. However some ./configure options have
performance impacts both for parallel and non-parallel execution. These should be switched off.
See README-2-configure for details.
One other comment:

Before I realised that the svn version is more recent, I used the apl-1.8.tar.gz version of the code that is available on the Gnu mirror.  This seems to have a minor error in Parallel.hh: two occurrences of & in the definition of PRINT_LOCKED, which cause a compilation error.  They appear to have been removed in the svn version.

Yes. In the early days of GNU APL I updated the apl-1.X.tar.gz files after every bug fix. I was then told
by the GNU project that this would mess up their mirrors so I stopped doing that. Therefore problems in
1.8 will only be fixed in 1.9, typically 1-2 years later.
Any comments or answers would be appreciated.  Thank you for taking the time to read my e-mail.

You're wecome

reply via email to

[Prev in Thread] Current Thread [Next in Thread]