accelerate Performance issues

I'm trying to use accelerate to get better-than-hmatrix performance on a program similar to this one:

module Main where

import Data.IORef
import Data.Array.Accelerate
import Data.Array.Accelerate.CUDA
--import Data.Array.Accelerate.Interpreter

import Prelude hiding (zipWith, replicate, (++), sum, map)

type Matrix e = Array DIM2 e

dot :: (Elt a, IsNum a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dot xs ys = fold1 (+) (zipWith (*) xs ys)

norm xs = map sqrt (dot xs xs)

mult
  :: (Elt a, IsNum a) =>
     Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a)

mult m v = fold1 (+) (zipWith (*) m vs)
  where
    n = indexHead . indexTail $ shape m
    vs = replicate (lift $ Z :. n :. All) v

main = do
  let n = 1000 :: Int

  let m = fill (constant $ Z :. n :. n) (constant (1 :: Float))
  vRef <- newIORef $ fill (constant $ Z :. n) (constant 1)

  let iterate = do
      v <- readIORef vRef
      let mv = mult m v
          vv = norm v
          v' = map (/ the vv) mv
      --writeIORef vRef (compute v')
      writeIORef vRef v'

  traverse (const iterate) [1..1000]

  v <- readIORef vRef
  print $ run (norm v)

Here's the output of profiling:

        Tue Feb  9 14:27 2016 Time and Allocation Profiling Report  (Final)

           TestAccelerate +RTS -p -h -s -RTS

        total time  =        4.03 secs   (4032 ticks @ 1000 us, 1 processor)
        total alloc = 4,114,779,024 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                                     %time %alloc

compile.code        Data.Array.Accelerate.CUDA.Compile          15.1   33.1
ppr                 Data.Array.Accelerate.CUDA.CodeGen.Base     10.1   11.1
hashWithSalt        Data.Array.Accelerate.CUDA.CodeGen.Monad     2.8    5.3
lookup/go           Data.HashTable.ST.Basic                      1.6    0.1
compile.key         Data.Array.Accelerate.CUDA.Compile           1.5    3.7
rebuildPreOpenExp   Data.Array.Accelerate.Trafo.Substitution     1.4    1.3
==                  Data.Array.Accelerate.Trafo.Sharing          1.4    1.8
codegenOpenExp.cvtE Data.Array.Accelerate.CUDA.CodeGen           1.2    1.2
withLifetime        Data.Array.Accelerate.Lifetime               1.1    0.1
insert              Data.HashTable.ST.Basic                      1.1    0.3
readDelLoad         Data.HashTable.ST.Basic                      1.1    0.6
blockSize           Data.Array.Accelerate.CUDA.Analysis.Launch   0.8    1.7
fmap                Data.Array.Accelerate.Trafo.Substitution     0.8    1.2
readArray           Data.HashTable.Internal.IntArray             0.7    1.0

And here's the output of -s:

   6,372,076,320 bytes allocated in the heap
   3,751,952,272 bytes copied during GC
     164,069,960 bytes maximum residency (49 sample(s))
       2,371,928 bytes maximum slop
             328 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     12011 colls,     0 par    3.477s   3.440s     0.0003s    0.0012s
  Gen  1        49 colls,     0 par    5.114s   5.071s     0.1035s    0.2412s

  TASKS: 6 (2 bound, 4 peak workers (4 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time    4.539s  (  6.391s elapsed)
  GC      time    6.669s  (  6.604s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    1.922s  (  1.907s elapsed)
  EXIT    time    0.003s  (  0.003s elapsed)
  Total   time   13.140s  ( 12.999s elapsed)

  Alloc rate    1,403,793,613 bytes per MUT second

  Productivity  34.6% of total user, 35.0% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

As you can see, a huge amount of time is spent in GC. That clearly should not be the case for a program that's just number crunching.

An obvious potential issue is that I'm building a very large kernel with each iteration and then running only at the end. The solution would be to express the iteration as an Acc -> Acc function and then use awhile to run it a fixed number of times. Unfortunately, it is very difficult for me to turn my real program into a pure Acc -> Acc function, since it does a large graph traversal that relies on IORefs. I've also tried inserting a compute (id >-> id) into each iteration, with no effect.

Feb 09 '16 19:02 vladfi1

Yeah, I'm pretty sure this is happening because you're generating a huge kernel. You can do "print $ norm v" and it should print the AST for you to see.

Why can't you do run1 for each iteration? Take the current state, pass it into to run1, take the result out, modify it and call run1 again. "run1" will cache the simple kernel it generated for one iteration and should run fast for subsequent calls to it.

Feb 10 '16 05:02 ghost

@cpdurham: As I said, my real program is doing a complicated graph traversal involving IORefs at each node and thus is very difficult to express a pure function. Also, won't calling run1 cause needless data transfer?

Feb 10 '16 19:02 vladfi1

@vladfi1 Gotcha, but unless there is an optimization happening under the hood (you can check by printing the ast), the compiled cuda code is literally copied 1000 times with a different constant coming from each of those IO refs. Most of your time is probably spent just compiling this monster. (Just checked, the line count in the AST is 27304 lines)

You also are expressing a pure function either way, in this case, the "pure" function is the hard coded AST that is then sent to CUDA. The AST never sees those IORefs. Not to mention that your IO Ref is actually storing part of that AST.

In my experience, doing a run1 for each iteration is going to be way faster than how you are going about it right now until you maybe come up with a way to do the whole thing in the GPU.

module Main where

import Data.IORef
import Data.Array.Accelerate
import Data.Array.Accelerate.CUDA
--import Data.Array.Accelerate.Interpreter

import Prelude hiding (zipWith, replicate, (++), sum, map)

type Matrix e = Array DIM2 e

dot :: (Elt a, IsNum a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dot xs ys = fold1 (+) (zipWith (*) xs ys)

norm xs = map sqrt (dot xs xs)

mult
  :: (Elt a, IsNum a) =>
     Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a)

mult m v = fold1 (+) (zipWith (*) m vs)
  where
    n = indexHead . indexTail $ shape m
    vs = replicate (lift $ Z :. n :. All) v

main = do
  let n = 1000 :: Int

  let m = fill (constant $ Z :. n :. n) (constant (1 :: Float))
      vs = run $ fill (constant $ Z :. n) (constant 1)

  let iterate =
        run1 (\v -> let
                 mv = mult m v
                 vv = norm v
                 v' = map (/the vv) mv
              in
                v')
  print $ run1 norm $ doN iterate 999 vs

doN :: (a -> a) -> Int -> a -> a
doN f = doN'
  where doN' 0 a = a
        doN' n a = let a' = f a in a' `seq` doN' (n-1) a'

Your original code ran in about 5 seconds on my box, that's running in 0.8 now. I realize that this probably isn't how you can use it for your application, but hopefully this gives some ideas on how you might be able to restructure your code. This isn't ideal, because like you said, it's transferring around a lot (I'm getting 10k iterations in 2 seconds and 100k in 20), but it should be better than having the entire thing compiled all at once.

Good luck!

Feb 10 '16 20:02 ghost

Well once I have a function in the form taken by run1 (Acc -> Acc) I can iterate it for a fixed number of times on the GPU with awhile.

I think I'm starting to get a better understanding of what's going on here. Basically run1 f compiles f and bakes in the memory transfer to and from the GPU. To avoid lots of needless recompilation I should store the result of calling run1 somewhere (iterate as you have written it) and reuse it. Presumably this could also be done in a way that doesn't bake in the memory transfer - such a function would probably be called compile :: (Acc a -> Acc b) -> Acc a -> Acc b.

Feb 10 '16 20:02 vladfi1

Actually probably the best way to think about it is "compile". run1 from what I understand takes the AST and compiles it down to cuda and stores the compiled kernel to call quickly when needed. If you were to use "run" instead, any time something in the AST changes, the thing needs to be recompiled again. Thus, "run1" is quick as long as the only thing changing are the arguments being passed into run1's second argument. I'm not too familiar with the interworkings of accelerate and accelerate-cuda, but if I had to guess i'd say that this AST is compiled and then called through the foreign function interface to the compiled kernel with the array that is passed in.

"Well once I have a function in the form taken by run1 (Acc -> Acc) I can iterate it for a fixed number of times on the GPU with awhile."

The reason why I assumed you needed the "Acc -> Acc" function is so you can do some non-gpu updates to a state and then pass back down into the gpu. But yeah, if you figure out a way to do the whole thing in the gpu, then you can just iterate inside the gpu instead and skip returning back and forth from gpu to cpu. I had a similar project where I was processing camera data. run1 f would take my state and image and return an updated state. Once I understood the problem better, I just transferred chunks of images all at once to be processed.

Feb 10 '16 21:02 ghost

To clarify, by Acc I mean a type that is stored on the GPU. run :: Acc a -> a removes the Acc wrapper and copies to the CPU. Maybe this isn't a great way to think about this since in reality the Acc things are syntax trees...

As it turns out, my real use case doesn't always perform lots of iterations of the same kernel/IO action. Sometimes I just do one really big graph traversal, and then I'm done with the graph and move on to a new one. This seems like a pretty hard case to optimize. However, the individual operations performed at each node are always selected from a relatively small pool (dot, mult, norm, etc.) - I suspect that if I could get accelerate to cache each of these individual operations, then the cost of combining them during the big graph traversal shouldn't be too high?

Also thanks @cpdurham for the help!

Feb 10 '16 21:02 vladfi1

I realize now that the compile :: (Acc a -> Acc b) -> Acc a -> Acc b I gave is not possible, since Acc is backend-agnostic. Is there any way to help the compiler figure out which functions it's seen before to save on compilation time? I suppose I could go all the way and generate my own Acc ASTs that maximize sharing (as opposed to the bloated repetitive ones that are currently generated), but that still suffers from the same issue that the kernels themselves might be difficult to cache across calls to run.

Feb 11 '16 01:02 vladfi1

After taking a look at some of the Accelerate code, I see that there are actually two different versions of Acc: a "HOAS" version defined in Data.Array.Accelerate.Smart, and a De Bruijn version defined in Data.Array.Accelerate.AST. The Acc that all the user-facing accelerate functions work on is the first, but when you print out an Acc a conversion is done to the De Bruijn representation first. This conversion, defined in Data.Array.Accelerate.Trafo, also attempts to "recover sharing". My problem right now is figuring out how to generate the HOAS ASTs with sharing baked-in; unfortunately only the De Bruijn ASTs have an explicit let construct.

Not that it's really relevant, but I was also confused by the use of nested tuples for the Acc environments instead of type-level lists.

Feb 11 '16 20:02 vladfi1

One problem might be that because of the way kernel caching works at the moment, it still needs to generate the code for the kernel before it can check whether or not it exists in the cache; if it does we avoid the compilation time, and since nvcc might take several seconds per kernel to compile this is a big win. In your program, since you have many many operations that need to be checked against a relatively small cache, then if we can do the lookup earlier, avoiding code generation as well, then this should help you (and everybody else!). We've discussed doing this before, but I haven't had time to look into it yet... (and unfortunately I don't think I will be able to until around mid March, due to impending ~~doom~~ deadlines).

The great thing about run1 (as @cpdurham explained, thanks!!) is that it avoids all this nonsense; the function it returns is ready to execute immediately, no compiling kernels or looking up into caches required. That is great if you want to call that thing more than once, but probably won't help you if your program doesn't fall into this model. (setting up a run1 and using it only once is equivalent to just calling run though, so no need to worry about extra overhead.)

Feb 12 '16 16:02 tmcdonell

We should also get an accelerate-blas package up, so that the individual mult, dot, etc., operations are efficient.

Feb 12 '16 17:02 tmcdonell

Right now I have two semi-independent goals:

Generate ASTs where the matrix ops are explicitly let-shared. I'm not sure if this is even possible - maybe only non-functions are allowed to be let-bound?
Successfully cache reused operations. This would involve calling cuda backend functions myself. Essentially what I'd like is a function like run1, but which doesn't assume that I want to transfer to and from the GPU. One guess for how to do this is to re-use the compiled kernels as foreign functions, which can be embedded into larger ASTs.

@tmcdonell Any tips would be appreciated!

Feb 12 '16 19:02 vladfi1

I've convinced myself that goal #1 is impossible, since the DeBruijn Alet constructor only takes acc values, and the HOAS functions (Data.Array.Accelerate.Trafo.Sharing) are constrained to be first-order.

Goal #2 seems like it might be doable though. I just need to figure out how to get the cuda backend to spit out something accelerate understands to be a Foreign function. This should allow for a pretty nice mixing of compiled and uncompiled kernels.

There are still a bunch details I don't quite understand yet, like whats up with all the "Delayed" types, but hopefully those won't be an issue.

Feb 12 '16 22:02 vladfi1

Bump? Does compiling kernels into Foreign functions sound like a workable solution?

Feb 24 '16 01:02 vladfi1

If you turn on debugging mode, you can get it to spit out the generated code with -ddump-cc. You could compile those yourself and turn it into a foreign function call. It's not designed for that though, so it might be difficult to relate each kernel to the corresponding part of your program. If you can do it though, it might be worthwhile, given the profile you pasted earlier.

Delayed results from combining multiple operations into a single kernel. dot for example will compile to a single kernel, rather than two, even though that is how it was written.

Feb 24 '16 22:02 tmcdonell