CSVFiles.jl icon indicating copy to clipboard operation
CSVFiles.jl copied to clipboard

File reading time

Open bkamins opened this issue 6 years ago • 3 comments

In https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb I had to disable CSVFiles.jl file reading tests as it failed to load a small file (that reads in a few seconds otherwise) in any reasonable time.

The file read has 500 columns and 500'000 so it is relatively small.

@davidanthoff - do you think this issue is solvable?

bkamins avatar Aug 30 '19 18:08 bkamins

I can replicate it. Has this worked better in the past and regressed? Or did this never work?

davidanthoff avatar Sep 09 '19 19:09 davidanthoff

I cannot tell, as I have increased size of tests only recently, as the old ones were just to small to show anything meaningful.

bkamins avatar Sep 09 '19 19:09 bkamins

I've stumbled upon this issue, so some comments for reference.

MWE

bigdf = DataFrame(rand(Bool, 10^5, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))

load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame  # Here it fails to load

I've tried to vary the number of rows and columns and get following results

# 10^2 x 50 = 973.310 μs (30338 allocations: 3.26 MiB)
# 10^2 x 500 = 58.562 ms (307888 allocations: 238.42 MiB)
# 10^3 x 500 = 605.530 ms (2551391 allocations: 2.21 GiB)
# 10^4 x 500 = 21.693 s (24961891 allocations: 22.03 GiB)

So it's more or less linear in time (supposedly 10^3 -> 10^4 nonlinear increase may be related to the fact that I run out of memory and os start swapping).

Profiling shows the following

bigdf = DataFrame(rand(Bool, 10^2, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))

load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame

Profile.clear()
@profile load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame
Profile.print(format = :flat, sortedby = :count)

omitting noise

 126 ./tuple.jl                                                            24 getindex                                               
   133 /home/skoffer/.julia/dev/TextParse/src/record.jl                      38 macro expansion                                        
   134 /home/skoffer/.julia/dev/TextParse/src/record.jl                      50 tryparsesetindex(::TextParse.Record{Tuple{TextParse....
   136 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        337 #_csvread_internal#52(::Bool, ::Char, ::Char, ::Noth...
   136 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        600 parsefill!(::TextParse.VectorBackedUTF8String, ::Tex...
   155 /home/skoffer/.julia/dev/TextParse/src/util.jl                        27 macro expansion                                        
   157 ./io.jl                                                              298 #open#271(::Base.Iterators.Pairs{Union{},Union{},Tup...
   157 ./io.jl                                                              296 open                                                   
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        116 (::TextParse.var"#38#40"{Base.Iterators.Pairs{Symbol...
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        113 #_csvread_f#36                                         
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                         80 #csvread#16(::Base.Iterators.Pairs{Symbol,UnionAll,T...
   157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl         103 _loaddata(::CSVFiles.CSVFile)                          
   157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl         116 get_columns_copy_using_missing(::CSVFiles.CSVFile)     

It looks like main problem is actually in TextParse, specifically in tryparsesetindex function of the record.jl

I've used last master version of the TextParse, commit "8f9ac08ee110467ba43e52d3449c74ab34391f06"

Arkoniak avatar Feb 08 '20 19:02 Arkoniak