Very slow approx_size for DataFrames

Open DrChainsaw opened this issue 4 years ago • 0 comments

When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size is the bottleneck due to it falling back to Base.summarysize.

Here is a quick MWE:

julia>  using BenchmarkTools, DataFrames, MemPool

julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  61.03 MiB
  allocs estimate:  1999540
  --------------
  minimum time:     110.895 ms (4.59% GC)
  median time:      119.604 ms (2.47% GC)
  mean time:        122.978 ms (2.83% GC)
  maximum time:     146.009 ms (1.46% GC)
  --------------
  samples:          41
  evals/sample:     1

Here is a sketch of an alternative implementation which is much faster:

julia> function MemPool.approx_size(df::DataFrame)
       dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
       namesize = mapreduce(MemPool.approx_size, +, names(df))
       return dsize + namesize
       end

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  13
  --------------
  minimum time:     535.700 μs (0.00% GC)
  median time:      636.800 μs (0.00% GC)
  mean time:        664.967 μs (0.00% GC)
  maximum time:     1.525 ms (0.00% GC)
  --------------
  samples:          7499
  evals/sample:     1

The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.

Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.

Mar 04 '21 21:03 DrChainsaw