The discussions around paralellisation let me to start
thinking about ways to improve this in APL, and I was
wondering if it would be possible to make a purely immutable
APL implementation. If so, it would, at least in theory,
bepossble to defer a lot of computations until a result is
actually requested. This would be a lot easier to parallelise.
Not quite sure what purely immutable implementation means. Is it
lazy
evaluation where the computation of the result is deferred until the
result is needed?
Yes, that is exactly what it means. Executing a catenation of two arrays will not create a new, larger arrays. Instead, it preserves the references to the old arrays and creates a special CatenatedArray that acts as a facade in front of them. Only if you actually read a value from this array will that value (and only that value) be computed. In the catenation case, that means simply picking the correct value from the proper underlying array.
I very much appreciate your focus on practical relevance. Over time
I
have seen some proposals onthe web that were promising theoretical
improvements but were easily debunked when looking at the details.
What we should aim at is the total execution time of a real world
APL
application.
Absolutely. My goal here is not to find algorithms that I can optimise, but rather to determine how real-world algorithms work with an immutable and lazy implementation.
+/⍳ is an example that was mentioned in the context of lazy
evaluation
and it is a very good example for what we should NOT aim at. The
good
news is that the time-complexity of +/⍳ ran be reduced from
O(N) down to O(1). Perfekt. Shall we optimize it? Certainly not. The
reason is that +/⍳ is simply a programming mistake. If a programmer
chooses to write +/⍳N instead of, for example, (-N-N×N)÷2, then
improving on that is only fixing obvious mistakes rather than
improving
APL. Already 9-year old Freddy Gauß knew how to compute +/⍳ so we
should require that an APL programmer knows it as well.
+/⍳ is only constant is space in my version. It's still O(n) in time. I'm not optimising for this case specifically, but the constant space usage comes from the fact that ⍳N never realises an actual array of N elements, but rather simply instantiates an IotaArray containing "virtual" cells.
By coincidence, I was myself thinking about optimizing +/⍳ some
years
ago (it is a rather low hanging fruit with an impressive speedup)
but I
then dropped it due to its lack of practical relevance.
And I agree with you fully. It's utterly pointless. I have considered similar optimisations coming out of the fact that my experiment uses a lazy evaulation. For example (and simplified), A+B creates a PlusArray(A, B) which will perform the individual additions when a cell is looked up. It's conceivable to create an optimisation that replaces PlusArray(A, PlusArray(B, C)) with: PlusArray(A, B+C), if B and C are scalars.
It is things like this that I would like to explore, because it's very likely that such optimisations will not be worth it unless the arrays are very large.
Also, x[N]←V is invalid code since arrays cannot be
modified.
I suppose that modifying variables is a rather common programming
style in APL. Not allowing it for the sake of simpler
parallelization can
easily fire back (increasing the total execution time or space).
True. This limitation could be eliminated with some clever refcounting so that it's possible to know if an underlying array is used in a different place. Either that, or one can collapse the operation stack prior to assignment. This is out of scope for my experiments however.
The reason I removed the mutability of arrays is because of this:
A←B+C
B[n]←X
(do something with A)
Since the lazy implementation hasn't evaluated the addition by the time the assignment happens, the content of A will be affected by the result of the assignment to B[n]. Completely eliminating modifications of arrays conveniently removes any possibility for this to happen.
The true challenge of parallelization is not to find faster
implementations
for specific constructs like +/⍳ but to understand the trade-offs
between
the benefits of parallel execution on the one hand and the
(primarily
run-time) cost for them on the other.
The reason I'm experimenting with this for the purpose of parallelisation is in order to more efficiently handle something like this:
foo bar A+B
In this case, this _expression_ will yield resulting array object that looks something like this:
FnApply {
name: foo
rightArg: FnApply {
name: bar
rightArg: PlusArray {
leftArg: A
rightArg: B }}}
Each time a value is read from this structure, the operation foo(bar(plus(A,B))) is performed. In other words, I'm not doing three different parallelisations, but rather a single one, where each thread can perform all the operations in one go. This drastically reduces the amount of synchronisation needed, which I am hypothesising is a contributor to poor multi-core utilisation.
Also, here is another reason to avoid mutability: If user-defined functions can be evaluated in parallel, allowing modification of data would be quite dangerous.
Of course, I have no idea if any of this is actually true. And even if it is, will it make enough of a difference for real-world code? That's why I need real examples rather than my simple test code.
In the +/⍳ example this is easy (although useless) since +/⍳ can be
detected
statically (at ⎕FX time). However lazy evaluation in the general
case has
far higher run-time cost than +/⍳ so it remains to be seen if those
run-
time cost can ever be be amortized in a real-world application. Many
of
the examples brought up by proponents of lazy evaluation have the
following design pattern:
It does indeed have higher run-time cost. My code is written in Kotlin and can both run on the JVM and compile to native code. However, the native code generation in Kotlin is terrible and not usable for a real-world test. The JVM version runs pretty well, but since it and GNU APL are such different things it's very hard to draw any conclusions at all.
For the sake of completeness: I've basically only timed +/⍳N for very large N's. The execution time for the two versions were roughly the same (GNU APL was a few percent faster?) but the test case itself is heavily skewed against GNU APL, since my prototype code doesn't do any allocations at all, while GNU APL needs to fill and update a lot of memory, which will thrash the cache.
Regards,
Elias