[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
array operations
From: |
Daniel Llorens |
Subject: |
array operations |
Date: |
Sat, 9 Mar 2013 00:42:43 +0100 |
On Mar 4, 2013, at 03:27, Noah Lavine wrote:
> (I'm snipping the rest of your message because it needs more thought than I
> can give it right now.)
That message contained a number of errors, the consequence of talking before
walking, I suppose.
I have added a simple reduction operator to the library, enough to define an
inner product operation:
(define* (dot + * A B #:key type)
"dot + * A B
Inner product between the last axis of A and the first of B."
(let ((type (or type (array-type* A))))
(ply/type type
(w/rank (Jv (cut folda/type type (lambda (c a b) (+ c (* a b))) 0
<> <>) '_ '_)
1 '_)
A B)))
There are three loops in here:
---the outer loop (ply) over the axes of A, save the last (whence the arg 1 to
w/rank)
---the middle loop (folda) over the last of A and the first of B (the axes
being reduced)
---the inner loop (ply) over the axes of B, save the first —--this is because
folda applies its op (lambda (c a b) ...) as an array op.
If A & B are both rank 2, this is like 'ikj' in §1.1.11 of Golub & Van Loan.
For example:
(define a (i. 300 100))
(define b (i. 100 200))
(define af (array-copy 'f64 a))
(define bf (array-copy 'f64 b))
(define as (array-copy 's64 a))
(define bs (array-copy 's64 b))
> ,time (dot + * a b)
$19 = #
;; 4.836000s real time, 4.870000s run time. 0.350000s spent in GC.
> ,time (dot + * af bf)
$20 = #
;; 7.166000s real time, 7.210000s run time. 0.690000s spent in GC.
> ,time (dot + * as bs)
$21 = #
;; 10.218000s real time, 10.280000s run time. 0.240000s spent in GC.
So it's horribly slow, and the differences between types are very large. Still,
this shows that it's spending a significant fraction of the time in
arithmetic, or at least not in loop overhead (I think?)
The profiler output for the type #t version is
> ,profile (dot + * a b)
% cumulative self
time seconds seconds name
71.43 5.07 5.07 #<procedure 11d767ab0 at util/reduce.scm:87:47 (c a
b)>
7.14 1.01 0.51 map
7.14 0.51 0.51 =
7.14 0.51 0.51 %after-gc-thunk
7.14 0.51 0.51 #<procedure 11c52ff90 at util/ploy.scm:165:16 i>
0.00 7.10 0.00 #<procedure 1163083a0 at ice-9/top-repl.scm:66:5 ()>
0.00 7.10 0.00 array-index-map!
0.00 7.10 0.00 call-with-prompt
0.00 7.10 0.00 folda/type
0.00 7.10 0.00 apply-smob/1
0.00 7.10 0.00 catch
0.00 7.10 0.00 start-repl
0.00 7.10 0.00 run-repl
0.00 7.10 0.00 ply/type
0.00 7.10 0.00 statprof
0.00 7.10 0.00 #<procedure 1163085a0 at ice-9/top-repl.scm:31:6
(thunk)>
0.00 7.10 0.00 #<procedure 1189d1d20 at util/ploy.scm:121:5 i>
0.00 7.10 0.00 #<procedure 1189cf210 at statprof.scm:655:4 ()>
0.00 5.07 0.00 array-map!
0.00 1.52 0.00 nested-op-frames
0.00 0.51 0.00 nested-match-frame
0.00 0.51 0.00 array-map/frame!
0.00 0.51 0.00 length=?
0.00 0.51 0.00 make-shared-array
---
Sample count: 14
Total time: 7.1 seconds (0.22 seconds in GC)
The top entry is (lambda (c a b) ...) in the definition of dot above. The map
comes from the definition of folda/type (the middle loop).
(define (folda/type type op_ z . a)
(if (null? a)
z
(let ((op (if (verb? op_) op_ (make-verb op_ #f #f))))
(let ((end (tally (car a))))
(let loop ((i 0) (c z))
(if (< i end)
; @todo factor out frame matching from repeated application of ply
(loop (+ 1 i) (apply ply/type type op c (map (cut from_ <> i) a)))
c))))))
^
here -----------------------------------------------
There's no map in the inner loop since the op has rank 0, so it resolves to
array-map!.
------------
As a reference point,
> ,time (blas-dgemm a b 0. 'no 'no )
$28 = #
;; 0.009000s real time, 0.000000s run time. 0.000000s spent in GC.
> ,time (blas-dgemm af bf 0. 'no 'no )
$29 = #
;; 0.002000s real time, 0.000000s run time. 0.000000s spent in GC.
is a few 1000 times faster. It takes more than 3 times longer to convert from
scm to f64 than to multiply... lots of room for improvement.
------------
1) The variable rank/arity code is by necessity full of apply / map / rest
args. However, in practice, it will be used most of the time with rank 1 / 2 /
maybe 3. It's not acceptable to have this apply-map business going on in an
inner loop, but I don't want to write rank-specific or arity-specific variants
by hand. I could try to use syntax-rules' ... for the fixed rank/arity
functions and select with case-lambda, but that still requires that I write two
versions of nearly every function in the library. Is there a way to avoid this?
It occurred to me that the partial evaluator could help, but I'm not sure how.
Or maybe somebody has a meta-macro that takes the version using ... and
converts it to the variable rank/arity version or viceversa.
2) Iterating over an array is a pain. I had this issue before when mapping over
subarrays, so I wrote an array-map/frame!. If I write an array-for-each/frame
(let ((c z))
(apply array-for-each/frame f (lambda a (set! c (apply ply/type type op c
a))) a)
c)
at least this would remove the map-from in the scalar case. It's something...
still, carrying c outside it's a bit ugly.
3) is it normal that the arithmetic is so slow? the code above generates tons
of temporaries and has a ton of overhead and it still manages to spend most of
its time in the + * op.
Regards,
Daniel