Bad performance on `Array a` and `MutArray a` while using as `Ptr`
Since I observed write performance issue in my program, I found that it was caused by Array Word8 accidentally.
Here's the benchmark that could reproduce it. https://gist.github.com/TheKK/251fc1fe24600165ce1b2db922f1ac2d
I tried implementing asUnsafePtr without MonadIO m constraint as a reference. Below are the results from my laptop:
± cabal run -O2 bench:cdc -- +RTS -s -T -RTS --regress cycles:iters --regress mutatorCpuSeconds:iters --regress gcCpuSeconds:iters --regress allocated:iters
Up to date
benchmarking write/Array
time 260.8 μs (255.9 μs .. 266.8 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 257.1 μs (255.1 μs .. 259.5 μs)
std dev 7.116 μs (5.600 μs .. 10.51 μs)
cycles: 0.998 R² (0.997 R² .. 0.999 R²)
iters 572419.159 (562076.184 .. 585264.385)
y -1883493.983 (-3370725.280 .. -642113.082)
mutatorCpuSeconds: 0.998 R² (0.997 R² .. 0.999 R²)
iters 2.571e-4 (2.521e-4 .. 2.635e-4)
y -7.855e-4 (-1.474e-3 .. -1.975e-4)
gcCpuSeconds: 0.991 R² (0.984 R² .. 0.995 R²)
iters 9.898e-6 (9.618e-6 .. 1.014e-5)
y -3.252e-5 (-7.490e-5 .. 1.191e-5)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 693884.116 (693882.854 .. 693885.515)
y 2282.946 (1758.990 .. 2816.548)
variance introduced by outliers: 21% (moderately inflated)
benchmarking write/ByteString
time 146.7 μs (142.4 μs .. 152.9 μs)
0.992 R² (0.988 R² .. 0.998 R²)
mean 146.1 μs (143.6 μs .. 149.5 μs)
std dev 9.761 μs (7.391 μs .. 12.28 μs)
cycles: 0.992 R² (0.988 R² .. 0.998 R²)
iters 321951.335 (312722.537 .. 334991.437)
y -1406715.340 (-3578381.007 .. 136008.990)
mutatorCpuSeconds: 0.993 R² (0.989 R² .. 0.998 R²)
iters 1.457e-4 (1.417e-4 .. 1.517e-4)
y -5.871e-4 (-1.524e-3 .. 7.918e-5)
gcCpuSeconds: 0.910 R² (0.880 R² .. 0.945 R²)
iters 4.400e-6 (3.879e-6 .. 5.019e-6)
y -6.004e-5 (-1.586e-4 .. 2.950e-5)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 373790.274 (373789.508 .. 373791.086)
y 2574.756 (2152.690 .. 2998.360)
variance introduced by outliers: 65% (severely inflated)
benchmarking asPtr/Array
time 118.8 ns (116.9 ns .. 120.8 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 118.0 ns (116.6 ns .. 119.2 ns)
std dev 4.218 ns (3.329 ns .. 5.294 ns)
cycles: 0.999 R² (0.998 R² .. 0.999 R²)
iters 260.828 (256.445 .. 265.217)
y -225886.094 (-556091.679 .. 75566.247)
mutatorCpuSeconds: 0.999 R² (0.997 R² .. 0.999 R²)
iters 1.184e-7 (1.164e-7 .. 1.203e-7)
y -8.632e-5 (-2.252e-4 .. 5.290e-5)
gcCpuSeconds: 0.991 R² (0.986 R² .. 0.997 R²)
iters 6.501e-10 (6.253e-10 .. 6.798e-10)
y -1.155e-6 (-3.425e-6 .. 8.469e-7)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 312.000 (312.000 .. 312.001)
y 2983.914 (2796.086 .. 3182.479)
variance introduced by outliers: 55% (severely inflated)
benchmarking asPtr/Array'
time 10.91 ns (10.83 ns .. 10.98 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 10.85 ns (10.72 ns .. 10.94 ns)
std dev 346.9 ps (245.6 ps .. 444.4 ps)
cycles: 1.000 R² (0.999 R² .. 1.000 R²)
iters 23.951 (23.792 .. 24.105)
y -94395.670 (-226486.084 .. 24880.470)
mutatorCpuSeconds: 1.000 R² (0.999 R² .. 1.000 R²)
iters 1.090e-8 (1.082e-8 .. 1.097e-8)
y -3.215e-5 (-8.781e-5 .. 2.575e-5)
gcCpuSeconds: 0.993 R² (0.989 R² .. 0.997 R²)
iters 3.418e-11 (3.331e-11 .. 3.534e-11)
y -5.555e-7 (-1.316e-6 .. 1.362e-7)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 2989.826 (2810.229 .. 3178.715)
variance introduced by outliers: 53% (severely inflated)
benchmarking asPtr/ByteString
time 10.16 ns (10.07 ns .. 10.26 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 10.10 ns (10.01 ns .. 10.21 ns)
std dev 330.1 ps (251.7 ps .. 417.1 ps)
cycles: 0.999 R² (0.998 R² .. 0.999 R²)
iters 22.290 (22.099 .. 22.558)
y 65680.770 (-130834.465 .. 281810.287)
mutatorCpuSeconds: 0.999 R² (0.998 R² .. 0.999 R²)
iters 1.016e-8 (1.007e-8 .. 1.028e-8)
y 3.956e-5 (-5.135e-5 .. 1.400e-4)
gcCpuSeconds: NaN R² (NaN R² .. NaN R²)
iters 0.000 (0.000 .. 0.000)
y 0.000 (0.000 .. 0.000)
allocated: 0.000 R² (0.000 R² .. 0.017 R²)
iters 3.631e-7 (-4.365e-5 .. 4.944e-5)
y 2991.437 (2807.870 .. 3177.496)
variance introduced by outliers: 55% (severely inflated)
Marking arr_asPtrUnsafe and ma_asPtrUnsafe as NOINLINE makes benchmark, "asPtr/Array", drop to 15 ns. Still pretty fast comparing to "asPtr/Array" which is 118 ns.
By looking at the CORE, it's obvious that Array.asPtrUnsafe was called by worker wrapper which means one level of indirection.
Main.$wf1
= \ (ww_sa1F
:: ghc-prim:GHC.Prim.MutableByteArray# ghc-prim:GHC.Prim.RealWorld)
(ww1_sa1G :: ghc-prim:GHC.Prim.Int#) ->
Streamly.Internal.Data.Array.Mut.Type.$wasPtrUnsafe <== THIS
@IO
@Word8
@()
Control.Monad.IO.Class.$fMonadIOIO <== THIS
ww_sa1F
ww1_sa1G
(Main.main25
`cast` (<Ptr Word8>_R
%<'Many>_N ->_R Sym (ghc-prim:GHC.Types.N:IO[0] <()>_R)
:: (Ptr Word8
-> ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld
-> (# ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld, () #))
~R# (Ptr Word8 -> IO ())))
So this should be a real issue that affect all operations of Array relates to its Ptr interface.
Thanks @TheKK for looking into this deeply.
I am wondering if this gets resolved if we write asPtrUnsafe in MutArray/Type.hs as follows:
asPtrUnsafe :: MonadIO m => MutArray a -> (Ptr a -> IO b) -> m b
asPtrUnsafe arr f = liftIO $ do
contents <- Unboxed.pin $ arrContents arr
let !ptr = Ptr (byteArrayContents#
(unsafeCoerce# (getMutableByteArray# contents)))
r <- f (ptr `plusPtr` arrStart arr)
touch contents
return r
Basically, just change the function f type to use IO and move liftIO at the top level of this function. If that works, it is a minimal change. Or maybe we can just return IO and remove the MonadIO constraint entirely. Because there not much point using liftIO in this API now, if the user wants they can use it themsleves.
We can also add a benchmark for this.
The solution you suggested sound good to me. I think we could keep MonadIO constraint there so the interface of asPtrUnsafe is intact even though it's pre-release API.
I would appreciate if you can find time to create a PR about this.
Sure!
We can add new APIs unsafeAsPtrIO to use IO monad directly, as we already have unsafeAsPtr.