MemPool.jl
MemPool.jl copied to clipboard
Very slow approx_size for DataFrames
When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size is the bottleneck due to it falling back to Base.summarysize.
Here is a quick MWE:
julia> using BenchmarkTools, DataFrames, MemPool
julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));
julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial:
memory estimate: 61.03 MiB
allocs estimate: 1999540
--------------
minimum time: 110.895 ms (4.59% GC)
median time: 119.604 ms (2.47% GC)
mean time: 122.978 ms (2.83% GC)
maximum time: 146.009 ms (1.46% GC)
--------------
samples: 41
evals/sample: 1
Here is a sketch of an alternative implementation which is much faster:
julia> function MemPool.approx_size(df::DataFrame)
dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
namesize = mapreduce(MemPool.approx_size, +, names(df))
return dsize + namesize
end
julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial:
memory estimate: 704 bytes
allocs estimate: 13
--------------
minimum time: 535.700 μs (0.00% GC)
median time: 636.800 μs (0.00% GC)
mean time: 664.967 μs (0.00% GC)
maximum time: 1.525 ms (0.00% GC)
--------------
samples: 7499
evals/sample: 1
The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.
Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.