DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Using SparseVector as a column

Open heinrichkraus opened this issue 4 years ago • 6 comments

I have been experimenting with DataFrames and similar data structures to use in my current project, however I stumbled upon an issue. I am working with data structures where some columns can have only few entries that are different from zero. This is why I have been naturally using SparseArrays.jl. But there are some issues when using sparse vectors in DataFrames

One example could be

df = DataFrame(x = rand(n), y = spzeros(n))

where some values in y are set to some nonzero values.

DataFrames lets me create the data structure as above, however if I attempt to add rows, e.g. append!(df, df), I get an error since there is no resize! for sparse vectors.

Now, my question is if it should be allowed to use sparse vectors at all, or they should be treated differently. This could be solved by extending the field df.y with something like

vcat(df.y, spzeros(eltype(y), n_newrows).

Of course, this would require eltype(y) to have zero implemented.

heinrichkraus avatar May 22 '21 11:05 heinrichkraus

if it should be allowed to use sparse vectors at all

It is totally fine to use sparse vectors, as long you try to do operations that sparse vectors support.

since there is no resize! for sparse vectors.

This is the point - you are trying to do an operation (in place mutation) of a vector that does not support it. If you use vcat instead of append! all will work:

julia> n = 2
2

julia> df = DataFrame(x = rand(n), y = spzeros(n))
2×2 DataFrame
 Row │ x         y       
     │ Float64   Float64 
─────┼───────────────────
   1 │ 0.978642      0.0
   2 │ 0.881843      0.0

julia> df2 = vcat(df, df)
4×2 DataFrame
 Row │ x         y       
     │ Float64   Float64 
─────┼───────────────────
   1 │ 0.978642      0.0
   2 │ 0.881843      0.0
   3 │ 0.978642      0.0
   4 │ 0.881843      0.0

julia> df2.y
4-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0

However, as you can see the vector is not sparse any more.

This is something that in theory could be changed as we now use Tables.allocatecolumn. @nalimilan - do you think it is something we might want to change? (the change would be that in vcat we could use vcat for columns that are specified for all passed data frames)

bkamins avatar May 22 '21 13:05 bkamins

This is something that in theory could be changed as we now use Tables.allocatecolumn. @nalimilan - do you think it is something we might want to change? (the change would be that in vcat we could use vcat for columns that are specified for all passed data frames)

Yeah, that would make sense. Actually even when the column is missing from some data frames, it would make sense to take into account the type of the ones that exist, but we'd need a vcat promotion mechanism for that.

nalimilan avatar May 22 '21 20:05 nalimilan

This is the point - you are trying to do an operation (in place mutation) of a vector that does not support it. If you use vcat instead of append! all will work: However, as you can see the vector is not sparse any more.

I haven't thought about using vcat directly. Of course it is a major problem if the sparse vector converts to a dense vector. I believe that TypedTables.jl has it that way, however their data structure has some other major flaws (e.g. no missing values). In a perfect scenario there would be no missing values at all. That way type stability can be achieved in the columns which would be important for my use case.

heinrichkraus avatar May 23 '21 13:05 heinrichkraus

Yes - as noted we will improve it.

I did not put a high priority on this fix (we have big refactoring of other parts of JuliaData ecosystem that we focus on now), but if it is really essential for you to have it quickly please let me know and I will try to squeeze this in.

bkamins avatar May 23 '21 19:05 bkamins

No, I did not want to put any pressure on you! It would certainly be a very nice to have feature, but I can live with sparse vectors converting to dense vectors for the time being, keeping in mind that this is happening.

Thank you for the quick replies!

heinrichkraus avatar May 23 '21 20:05 heinrichkraus

Thank you - I just needed to get a feedback on priorities (as you can imagine we get a lot of requests and our bandwidth is limited). However, as you probably noted we carefully manage all issues, so at some point it will be handled.

bkamins avatar May 23 '21 21:05 bkamins