vector icon indicating copy to clipboard operation
vector copied to clipboard

Introduce basicUnsafeIndexM#

Open Bodigrim opened this issue 1 year ago • 6 comments

#485 did only half of the job: while GHC now knows that index is used strictly it still would not necessarily unpack it, because basicUnsafeIndexM must receive Int not Int#. Only after inlining an opportunity to erase boxing will arise.

This patch introduces basicUnsafeIndexM# to help GHC further. If it looks good, I'll go for basicUnsafeRead# / basicUnsafeWrite# in another PR.

Bodigrim avatar Mar 23 '24 00:03 Bodigrim

This is however very breaking change. Due to design of unboxed vector (which I consider a mistake) this change will break all unboxed vectors. They require defining data instances and MVector/Vector instances for each type.

Shimuuar avatar Mar 23 '24 12:03 Shimuuar

@Shimuuar what exactly would break? I tried couple of packages with manual instances for unboxed vectors, they seem to compile fine.

Bodigrim avatar Mar 23 '24 17:03 Bodigrim

Sorry, I missed mutually recursive basicUnsafeIndexM and basicUnsafeIndexM#. I withdraw objection

Shimuuar avatar Mar 23 '24 17:03 Shimuuar

I tried to measure impact of this optimization. To do so I added simple benchmark which computed variance of vector of doubles (branches index-profiling and index-profiling-B) in two otherwise identical variants: one with INLINE pragma and another with NOINLINE. It reads from vector so it should test changes to basicUnsafeIndex. I've run them with GHC 8.10, 9.2, 9.4, 9.6, 9.8

varianceNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE varianceNoInline #-}
varianceNoInline xs
  = VG.sum (VG.map (\x -> (x - s)^(2::Int)) xs) / n
  where
    n = fromIntegral $ VG.length xs
    s = VG.sum xs / n

Function specialized by GHC runs in constant space. Changes to indexing make no difference and performance is stable for all GHC version with 6 cycles/element.

For non-specialized situation is much more interesting. Adding strictness (#485) make no difference al all in this benchmark. It seems programs are identical, al least number of instruction is exactly same, so I'll compare baseline (stddev[base]) and this PR (stddev[unboxed]).

Allocations

Below are allocations per array element:

  8.10.7     index[base]     96
   9.2.8     index[base]     96
   9.4.8     index[base]     96
   9.6.3     index[base]     96
   9.8.1     index[base]     96
  8.10.7  index[unboxed]     64
   9.2.8  index[unboxed]     64
   9.4.8  index[unboxed]     64
   9.6.3  index[unboxed]    128
   9.8.1  index[unboxed]    128

This PR is clear improvement for GHC<=9.4 but somehow it become a pessimization for GHC>=9.6. I haven't looked into core so I have no idea why

Performance

Runtime performance follows same pattern:

    ghc             tag  INS/element  CYC/element
 8.10.7     index[base]  155.5        44.6
 8.10.7  index[unboxed]  128.6        32.7
  9.2.8     index[base]  153.9        44.9
  9.2.8  index[unboxed]  127.2        33.6
  9.4.8     index[base]  153.9        44.1
  9.4.8  index[unboxed]  127.3        34.1
  9.6.3     index[base]  189.9        65.0
  9.6.3  index[unboxed]  372.6       125.6
  9.8.1     index[base]  189.9        60.8
  9.8.1  index[unboxed]  372.6       120.3

25% win for GHC<=9.4 and 100% loss for GHC>=9.6. And latter performs worse even without this optimization. It looks we have some regression in GHC optimizer.

But I have looked into core yet and answer lies there.

Shimuuar avatar Apr 01 '24 17:04 Shimuuar

First of all some estimations. We need to perform indexing twice, which al least means allocating 2 Box Double (64B) per element with this optimization and 2 Box Double & 2 Int (96B) without. This means that generated code is rather tight. Something happened between 9.4 and 9.6

Cleaned up core for nonspecialized function could be seen in gist. Core is very similar between GHC versions and with/without this optimization. Notable differences:

GHC 9.4 → 9.8

GHC changed how worker-wrapper transformation is done. In GHC9.4 it passed basicLength and basicUnsafeIndexM class methods into worker function and 9.8 passes full dictionary:

9.4
varianceNoInline :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dVector_s2fQ :: Vector v_ Double)
      (xs_s2g2 :: v_ Double) ->
      case $dVector_s2fQ of
      { C:Vector ww_s2fS ww1_s2fT ww2_s2fU ww3_s2fV ww4_s2fW ww5_s2fX
                 ww6_s2fY ww7_s2fZ ww8_s2g0 ->
      case $wvarianceNoInline @v_ ww3_s2fV ww6_s2fY xs_s2g2 of ww9_s2g6
      { __DEFAULT -> D# ww9_s2g6 }}
9.8
-- RHS size: {terms: 10, types: 9, coercions: 0, joins: 0/0}
varianceNoInline
  :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dictVec :: Vector v_ Double)
      (vec0 :: v_ Double) ->
      case $wvarianceNoInline @v_ $dictVec vec0 of ww_sfOR
      { __DEFAULT -> D# ww_sfOR }

Apparently lookup of functions in the dictionary caused performance degradation (44 CYC/elt → 65 CYC/elt)

But allocations picture is a mystery:

  • case basicUnsafeIndexM @v_ @Double $dict vec (I# sc_sfOy) of allocates Box Double (32) & Int (16).
  • case basicUnsafeIndexM# @v_ @Double $dictVec vec sc_sfPc if allocated Box Double (32) and 32B somewhere else. Rest of the core is essentially same.

P.S. Box trick seems to be terribly wasteful in case when GHC can't specialize. It doubles allocations for small values

Shimuuar avatar Apr 03 '24 10:04 Shimuuar

I found that using coerce in unboxed instances which just wrap primitive vectors gives nice performance boost when specialization doesn't happen. 3-10% percent for benchmark above. It also removes two branch instructions per index. Overall it's cheap optimization

No changes for case when specialization happens.

Shimuuar avatar Apr 08 '24 17:04 Shimuuar