Fusion causes work duplication
work_dup :: Acc (Vector Int) -> Acc (Vector Int)
work_dup xs = A.generate (index1 (A.size xs * 4)) g
where
g ix = ys A.! index1 (unindex1 ix `div` 4)
ys = A.map f xs
f x = A.iterate x (\i -> i * i) x -- An operation with high cost
After fusion, the above becomes this:
\a0 -> generate
(Z :. 4 * (shapeSize ((shape a0))))
(\x0 -> let x1 = a0!(Z :. div (indexHead x0,4))
in #0 (while (\x2 -> (#1 x2) <* x1)
(\x2 -> (1 + (#1 x2)
,let x3 = #0 x2
in x3 * x3))
(0,x1)))
By fusing the map into the generate here, f is being executed 4 times as often as it would have been without the fusion. In general, I'm not sure we should be fusing producers into generates as there is no guarantee that the array being indexed is not being indexed at the same index multiple times.
I'll look into it after ICFP, but...
- In a sense the work is duplicated, yes, but the additional work is done by different threads, so no.
- The same situation occurs with replicate, and that case is arguably more transparent.
- In general, there is no cost analysis of when it is preferable to (a) stop fusion; or (b) duplicate simple functions.
The reason this problem cropped up is that this is what is happening with the CUDA sample based n-body example. Each body has it's force calculations executed 4 times. For large numbers of bodies, this decreases the performance by a factor of around 2.5. While it's easy enough to fix that in this instance, by separating out the computations so that it doesn't fuse. I think we should at least have a way for the user to more easily force an array to be manifest.
On Sun, Sep 15, 2013 at 12:14 PM, Trevor L. McDonell < [email protected]> wrote:
I'll look into it after ICFP, but...
- In a sense the work is duplicated, yes, but the additional work is done by different threads, so no.
- The same situation occurs with replicate, and that case is arguably more transparent.
- In general, there is no cost analysis of when it is preferable to (a) stop fusion; or (b) duplicate simple functions.
— Reply to this email directly or view it on GitHubhttps://github.com/AccelerateHS/accelerate/issues/116#issuecomment-24462926 .
Backpermute and lifted indexing (which is structurally the same thing) are tricky to handle in fusion as they can lead to arbitrary duplication of work. (In the extreme case, indexing could always retrieve the same array element, which is then computed as often as the size of the result vector.)
In Repa, we basically gave up trying to solve these situations automatically and leave it to the programmer to be explicit about forcing arrays to be manifest. However, I think, in Accelerate we can do better, because it is a restricted language. Apart from the new loops, Accelerate computations are strongly normalising (always complete in finite time). Hence, we can perform much more sophisticated static analysis than would be possible for full Haskell.
I suspect that when we push nested parallelism further, such an analysis will become increasingly important. This is because in flat computations, it is reasonable to require from a programmer to explicitly force arrays where necessary. However, in nested computations, a lot of additional code is generated by the compiler and it becomes increasingly more difficult (a) for a programmer to understand where to force arrays and (b) to provide a convenient notation to perform the right amount of forcing.
BTW, have we got any example, where the sort of fusion that tripped up @robeverest (i.e., fusing an indexed array production into a generate) is helpful? Maybe we just don't want to do this by default...
@mchakravarty Currently zipWith{3,4...} are written in terms of generate. I assume that's from a time when we didn't have fusion. Obviously, they can be rewritten in terms of map and zipWith.
The issue I see is with concatenation ++, it can only be written in terms of generate and really should be fusible.
@robeverest actually zipWithN were originally written in terms of map and zipWith and I changed it to the current setup. From memory (fuzzy) it's easier on my poor simplifier because it doesn't have to look through lots of tuple/prj, and zipWith ultimately gets turned into generate anyway. I have a feeling that it should also make it easier to recover cases like unzip (because the term is simpler and consistent), but I never got around to experimenting with that.
Interestingly, the program I submitted with #148 runs ~ 30% worse when using pipe (five clusters), so in that case the duplication was beneficial. But, that might not be the same when running on a CPU, or with a higher number of clusters, etc...