CSV.jl
CSV.jl copied to clipboard
Performance regression since v0.8.0
Hi @quinnj ,
I think a similar performance regression like in #752 happened again. If I test with the fixed version from the mentioned issue v0.8.0 version, I get
`````````(blu) pkg> st
Status `/tmp/blu/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[336ed68f] CSV v0.8.0 `~/repos/CSV.jl`
[9a3f8284] Random
julia> @benchmark read($rows)
BenchmarkTools.Trial: 72 samples with 1 evaluation.
Range (min … max): 55.617 ms … 82.286 ms ┊ GC (min … max): 0.00% … 1.79%
Time (median): 70.487 ms ┊ GC (median): 2.00%
Time (mean ± σ): 69.935 ms ± 5.109 ms ┊ GC (mean ± σ): 1.19% ± 1.05%
▁▁ ▁ ▁ █ ▁▁▁▁
▄▁▁▁▁▁▁▄▁▁▁▁▄▁▁▄▇▄▄▄██▇▇█▄█▇▇▇▄▄▇▄█▇████▇▇▁▇▁▇▁▄▇▄▁▁▁▄▁▁▁▁▄ ▁
55.6 ms Histogram: frequency by time 82.1 ms <
Memory estimate: 24.41 MiB, allocs estimate: 800000.
With the latest 0.10.9 version I get
(blu) pkg> st
Status `/tmp/blu/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[336ed68f] CSV v0.10.9
[9a3f8284] Random
julia> @benchmark read($rows)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 519.511 ms … 575.751 ms ┊ GC (min … max): 2.39% … 1.39%
Time (median): 537.934 ms ┊ GC (median): 1.90%
Time (mean ± σ): 537.647 ms ± 16.968 ms ┊ GC (mean ± σ): 1.91% ± 0.42%
██ █ █ █ █ ██ █ █
██▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁██▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
520 ms Histogram: frequency by time 576 ms <
Memory estimate: 109.86 MiB, allocs estimate: 2400000.
Profiling shows a lot of calls to coltype
351 100% | getcolumn /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:319
7 0.38% 79.55% 351 18.99% | getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Int64) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:320
258 73.50% | coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl:23
45 12.82% | jl_apply_generic /home/domi/software/julia/julia/src/gf.c:2425
16 4.56% | getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Type{Union{Missing, PosLenString}}, ::Int64, ::Symbol) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:-1
3 0.85% | coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl
git bisect tells me that commit bfd415d0af4c7c1842ecc5b54f1e4a18b125a264 is the first bad commit
git bisect start
# status: waiting for both good and bad commits
# bad: [cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8] `typemap`: switch to IdDict (#1069)
git bisect bad cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8
# status: waiting for good commit(s), bad commit known
# good: [c94256adbe6c2ae017be90ae91976f3b5bb74aa4] bump version
git bisect good c94256adbe6c2ae017be90ae91976f3b5bb74aa4
# bad: [549b1ab03155c8c96485406881addcc65ef914e8] Take dependency on InlineStrings package (#923)
git bisect bad 549b1ab03155c8c96485406881addcc65ef914e8
# good: [f405361298ac09692024c0afdf546df880899223] Bump version
git bisect good f405361298ac09692024c0afdf546df880899223
# bad: [ea9eca2de56470a8e585ae4ee92495a88a632fbb] bump version
git bisect bad ea9eca2de56470a8e585ae4ee92495a88a632fbb
# bad: [17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4] Overhaul CSV.jl docs (#869)
git bisect bad 17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4
# bad: [0eaacb4c4d5787a5186dac8edf523ee9052db27f] Keyword argument cleanup in preparation for 1.0 release (#846)
git bisect bad 0eaacb4c4d5787a5186dac8edf523ee9052db27f
# bad: [ffda8d35793eea4f254f51de213527e5ed55359a] Fix nightly
git bisect bad ffda8d35793eea4f254f51de213527e5ed55359a
# bad: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)
git bisect bad bfd415d0af4c7c1842ecc5b54f1e4a18b125a264
# good: [fc209672d0a894954c5dc1e0835e0426b9d2925c] make "Edit on Github" points to main branch (#835)
git bisect good fc209672d0a894954c5dc1e0835e0426b9d2925c
# first bad commit: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)
tested with the script
using CSV
using BenchmarkTools
using Random
using Pkg
Pkg.resolve()
Pkg.status()
Random.seed!(0)
open("test.csv", "w") do f
for _ in 1:100_000
write(f, join([randstring('a':'z') for _ in 1:8], ","))
write(f, "\n")
end
end
function read(rows)
bla = 0
for r in rows
bla += hash(r.a)
bla += hash(r.b)
bla += hash(r.c)
bla += hash(r.d)
bla += hash(r.e)
bla += hash(r.f)
bla += hash(r.g)
bla += hash(r.h)
end
bla
end
rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
bench = @benchmarkable read($rows)
tune!(bench)
results = run(bench)
show(stdout::IO, MIME"text/plain"(), results)