Hi Andrew,
let me try to answer some of your questions inline
below...
On 2/7/20 6:35 PM, Andrew
wrote:
Good evening
This is my first post to this mailing
list. It is a mainly some questions, not a bug
report, so I hope it is appropriate to post it here.
Apologies if not. (And apologies also for a rather
long and rambling e-mail!)
No problem, youu found the right list.
I recently learned of Gnu APL and, having
had some experience of APL on IBM mainframes in the
1980s, I was curious to know how it would work on a
couple of my computers, and to use it to compare
performance of two virtualised and emulated
environments.
Firstly, I installed it on Ubuntu 18.04.3
running under VMWare Fusion on a 2.3GHz 8-core Intel
i9. This is the latest SVN version, built
using CORE_COUNT-WANTED=syl on ./configure (not make
parallels, which gave me a problem with autoconf). I
then used ⎕syl[26;2] to set the number of cores.
Using ⎕ai to obtain the compute time, I
tried using 1 and 4 cores for brute force prime number
counting, using this
_expression_: r←⍴(1=+⌿0=r∘.∣r)/r←1↓⍳n
⎕AI is rather imprecise, even worse than ⎕TS. For
performance measurements on Intel
CPUs you should use ⎕FIO ¯1 (return CPU cycle counter) and
maybe ⎕FIO ¯2 (return CPU frequency).
⎕FIO ¯1 is the most precise timing source that you can get
in GNU APL.
Although I could see, on the system
monitor, that 4 cores were being used, the execution
time with n=10000 actually took longer for the 4 core
case, typically 15-20% more time than the 1 core case.
The _expression_ above that you benchmarked is a mix of
parallelized and not parallelized APL
primitives. Each of them is subject to varying execution
times, so it is difficult to tell if the increased
execution time is caused by the parallel execution or by
the anyhow varying execution times.
However, I then tried it in a very
different environment: Ubuntu 18.04.3 again, but
running in an emulated IBM S/390 mainframe (using the
Hercules S/370 emulator running in Ubuntu in VMWare on
a 3.5 GHz 6-core Xeon). For n=5000, this gave the
opposite result: the 4 core case was approx. 45%
quicker.
In my experience using all cores of a CPU is not optimal
because external events from the OS (interrupts
etc) slow down one of the cores used for APL so that the
CPU(s) hit by external events increase the
execution time of each primitive. If you leave one core
unused (and if you are luck), then the scheduler
of the OS will see which cores are busy (execution APL)
and will direct thos events to the unused core.
I also rather doubt that a virtual or emulated environment
is able to tell anything about parallelized APL.
By the way there is a nechmarking workspace
Scalar3.apl
shipped with GNU APL that makes benchmarking of parallel
GNU APL easier. Intel I9 is a good platform for running
that workspace, but
avoid any virtualizations and
./configure
it properly.
Directly comparing these two environments
(one “simply” virtualized, the other emulated and
virtualized) is not meaningful. It is to be expected
that the emulated one will be very substantially
slower. The more interesting point is, perhaps, that
on the i9, using more cores actually slows it down
whereas, in the emulated environment, which is
effectively a *much* slower processor, using multiple
cores does yield a modest speed-up.
The speedups that can be achieved are generally
disappointing. I have also compared Intel I7 with intel
I9.
Seems like at the same CPU frequency and with the same
core count, the I9 uis substantially faster
than the I7 but at the same time the I7 benefits more from
parallelization than the I9. Most likely the
CPU optimizations in the I9 (compared to I7) aim at the
same kind of parallelism, so that improvements
of one aspect (CPU architecture) are made at the expense
of the other aspect (APL parallelization)
I am not sure which components of the
_expression_ (if any) would be parallelized by Gnu APL.
So my questions are:
1. Is it plausible that, on a reasonably
modern CPU (the i9), using multiple cores would slow
down execution of this _expression_?
Could very well be. The _expression_ has a rather small
amount of parallelization since the majority of
its primitives is not parallelized.
2. Which of the operators in the
_expression_ above would Gnu APL actually parallelize?
Currently all scalar functions and inner and outer
products of them. One can proove These are the ones
that in theory and given the GNU APL implementation they
must have a linear speedup (linear in the
number of cores). That is, on an I9 a scalar function on 4
cores must be 4 times faster than on one
core. In real life it is only 1.5 or so times faster. This
points to a hardware bottleneck between the cores
and the memory. The scalar functions are so lightweight
that the memory accesses (fetching the operands
and storing the results) dominate the entire execution
time.
3. Are there any configuration changes
that I could make to adjust the way in which
parallelization is done?
If you mean ./configure options by configurations then no.
However some ./configure options have
performance impacts both for parallel and non-parallel
execution. These should be switched off.
See README-2-configure for details.
One other comment:
Before I realised that the svn version is
more recent, I used the apl-1.8.tar.gz version
of the code that is available on the Gnu mirror. This
seems to have a minor error in Parallel.hh: two
occurrences of & in the definition of
PRINT_LOCKED, which cause a compilation error. They
appear to have been removed in the svn version.
Yes. In the early days of GNU APL I updated the
apl-1.X.tar.gz files after every bug fix. I
was then told
by the GNU project that this would mess up their mirrors
so I stopped doing that. Therefore problems in
1.8 will only be fixed in 1.9, typically 1-2 years later.
Any comments or answers would be
appreciated. Thank you for taking the time to read my
e-mail.
You're wecome
Jürgen
Andrew