CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

Performance regression since v0.8.0

Open schlurp opened this issue 3 years ago • 1 comments

Hi @quinnj ,

I think a similar performance regression like in #752 happened again. If I test with the fixed version from the mentioned issue v0.8.0 version, I get

`````````(blu) pkg> st
      Status `/tmp/blu/Project.toml`
   [6e4b80f9] BenchmarkTools v1.3.2
   [336ed68f] CSV v0.8.0 `~/repos/CSV.jl`
   [9a3f8284] Random
julia> @benchmark read($rows)

BenchmarkTools.Trial: 72 samples with 1 evaluation.
 Range (min … max):  55.617 ms … 82.286 ms  ┊ GC (min … max): 0.00% … 1.79%
 Time  (median):     70.487 ms              ┊ GC (median):    2.00%
 Time  (mean ± σ):   69.935 ms ±  5.109 ms  ┊ GC (mean ± σ):  1.19% ± 1.05%

                      ▁▁  ▁ ▁       █ ▁▁▁▁                     
  ▄▁▁▁▁▁▁▄▁▁▁▁▄▁▁▄▇▄▄▄██▇▇█▄█▇▇▇▄▄▇▄█▇████▇▇▁▇▁▇▁▄▇▄▁▁▁▄▁▁▁▁▄ ▁
  55.6 ms         Histogram: frequency by time        82.1 ms <

 Memory estimate: 24.41 MiB, allocs estimate: 800000.

With the latest 0.10.9 version I get

(blu) pkg> st
      Status `/tmp/blu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [9a3f8284] Random
julia> @benchmark read($rows)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  519.511 ms … 575.751 ms  ┊ GC (min … max): 2.39% … 1.39%
 Time  (median):     537.934 ms               ┊ GC (median):    1.90%
 Time  (mean ± σ):   537.647 ms ±  16.968 ms  ┊ GC (mean ± σ):  1.91% ± 0.42%

  ██  █   █         █  █ ██         █                         █  
  ██▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁██▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  520 ms           Histogram: frequency by time          576 ms <

 Memory estimate: 109.86 MiB, allocs estimate: 2400000.

Profiling shows a lot of calls to coltype

                                               351   100% |   getcolumn /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:319
         7  0.38% 79.55%        351 18.99%                | getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Int64) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:320
                                               258 73.50% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl:23
                                                45 12.82% |   jl_apply_generic /home/domi/software/julia/julia/src/gf.c:2425
                                                16  4.56% |   getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Type{Union{Missing, PosLenString}}, ::Int64, ::Symbol) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:-1
                                                 3  0.85% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl

schlurp avatar Feb 28 '23 15:02 schlurp

git bisect tells me that commit bfd415d0af4c7c1842ecc5b54f1e4a18b125a264 is the first bad commit

git bisect start
# status: waiting for both good and bad commits
# bad: [cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8] `typemap`: switch to IdDict (#1069)
git bisect bad cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8
# status: waiting for good commit(s), bad commit known
# good: [c94256adbe6c2ae017be90ae91976f3b5bb74aa4] bump version
git bisect good c94256adbe6c2ae017be90ae91976f3b5bb74aa4
# bad: [549b1ab03155c8c96485406881addcc65ef914e8] Take dependency on InlineStrings package (#923)
git bisect bad 549b1ab03155c8c96485406881addcc65ef914e8
# good: [f405361298ac09692024c0afdf546df880899223] Bump version
git bisect good f405361298ac09692024c0afdf546df880899223
# bad: [ea9eca2de56470a8e585ae4ee92495a88a632fbb] bump version
git bisect bad ea9eca2de56470a8e585ae4ee92495a88a632fbb
# bad: [17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4] Overhaul CSV.jl docs (#869)
git bisect bad 17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4
# bad: [0eaacb4c4d5787a5186dac8edf523ee9052db27f] Keyword argument cleanup in preparation for 1.0 release (#846)
git bisect bad 0eaacb4c4d5787a5186dac8edf523ee9052db27f
# bad: [ffda8d35793eea4f254f51de213527e5ed55359a] Fix nightly
git bisect bad ffda8d35793eea4f254f51de213527e5ed55359a
# bad: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)
git bisect bad bfd415d0af4c7c1842ecc5b54f1e4a18b125a264
# good: [fc209672d0a894954c5dc1e0835e0426b9d2925c] make "Edit on Github" points to main branch (#835)
git bisect good fc209672d0a894954c5dc1e0835e0426b9d2925c
# first bad commit: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)

tested with the script

using CSV
using BenchmarkTools
using Random
using Pkg

Pkg.resolve()
Pkg.status()

Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        write(f, join([randstring('a':'z') for _ in 1:8], ","))
        write(f, "\n")
    end
end
function read(rows)
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
bench = @benchmarkable read($rows)
tune!(bench)
results = run(bench)
show(stdout::IO, MIME"text/plain"(), results)

schlurp avatar Feb 28 '23 16:02 schlurp