Query.jl icon indicating copy to clipboard operation
Query.jl copied to clipboard

Filtering missing

Open CiaranOMara opened this issue 6 years ago • 10 comments

I encountered unexpected behaviour when attempting to filter values of type Missing. I found a solution at https://discourse.julialang.org/t/query-jl-filtering-on-missing-data/14898. I suppose this issue is a feature request for documentation that clarifies missing values in Query.jl.

Anyhow, the case is as follows.

julia> using DataFrames, Query

julia> df = DataFrame(a=[1,2,3], b=[1,2,missing])
3×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Int64⍰  │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1       │
│ 2   │ 2     │ 2       │
│ 3   │ 3     │ missing │

Attempting to filter for rows without values missing.

julia> df |> @filter(_.b !== missing) |> DataFrame
3×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Int64⍰  │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1       │
│ 2   │ 2     │ 2       │
│ 3   │ 3     │ missing │

# Expected behaviour.
julia> df[df.b .!== missing, :]
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Int64⍰ │
├─────┼───────┼────────┤
│ 1   │ 1     │ 1      │
│ 2   │ 2     │ 2      │

Attempting to filter for rows with values missing.

julia> df |> @filter(_.b === missing) |> DataFrame
0×2 DataFrame

# Expected behaviour.
julia> df[df.b .=== missing, :]
1×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Int64⍰  │
├─────┼───────┼─────────┤
│ 1   │ 3     │ missing │

Using DataValues.jl's isna function solution provides the expected result.

df |> @filter(!Query.isna(_.b)) |> DataFrame
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Int64⍰ │
├─────┼───────┼────────┤
│ 1   │ 1     │ 1      │
│ 2   │ 2     │ 2      │

julia> df |> @filter(Query.isna(_.b)) |> DataFrame
1×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Int64⍰  │
├─────┼───────┼─────────┤
│ 1   │ 3     │ missing │

CiaranOMara avatar Jun 27 '19 04:06 CiaranOMara

You can also just compare with NA: @filter(_.b==NA) should also work.

I thought we reexported isna from Query? If not, we should.

Doc PRs are always very welcome!

davidanthoff avatar Jun 27 '19 08:06 davidanthoff

Before having a go at adding something to the Doc, I have a confession and a few questions.

It was very natural for me to attempt a filter with the missing comparison as it wasn't immediately obvious to me that I was comparing against or performing operations on the DataValue container.

I'm having trouble discerning when DataValues become necessary in setting up an IterableTable and why plain NamedTuples are not sufficient. Would you please shed some light on this?

Also, what was the distinction such that the function ismissing was not included as a lifted function?

I thought we reexported isna from Query? If not, we should.

Currently the isna and NA are not re-exported from Query.jl.

(query_test) pkg> st
Project query_test v0.1.0
    Status `~/Programming/Julia/query_test/Project.toml`
  [a93c6f00] DataFrames v0.18.3
  [1a8c2f83] Query v0.11.0

julia> using Query, DataFrames
[ Info: Recompiling stale cache file /home/username/.julia/compiled/v1.1/DataFrames/AR9oZ.ji for DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]

julia> df = DataFrame(a=[1,2,3], b=[1,2,missing])
3×2 DataFrame
│ Row │ a     │ b       │
│     │ Int64 │ Int64⍰  │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1       │
│ 2   │ 2     │ 2       │
│ 3   │ 3     │ missing │

julia> df |> @filter(isna(_.b)) |> DataFrame
ERROR: UndefVarError: isna not defined
Stacktrace:
 [1] (::getfield(Main, Symbol("##8#10")))(::NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}}) at /home/username/.julia/packages/Query/UeHfO/src/query_translation.jl:58
 [2] iterate(::Tables.IteratorWrapper{QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}}) at /home/username/.julia/packages/QueryOperators/PfROc/src/enumerable/enumerable_filter.jl:20
 [3] iterate at ./iterators.jl:139 [inlined]
 [4] iterate at ./iterators.jl:138 [inlined]
 [5] buildcolumns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:92 [inlined]
 [6] columns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:174 [inlined]
 [7] #DataFrame#367(::Bool, ::Type, ::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:34
 [8] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:25
 [9] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}, ::Type) at ./operators.jl:813
 [10] top-level scope at none:0

julia> df |> @filter(_.b == NA) |> DataFrame
ERROR: UndefVarError: NA not defined
Stacktrace:
 [1] (::getfield(Main, Symbol("##12#14")))(::NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}}) at /home/username/.julia/packages/Query/UeHfO/src/query_translation.jl:58
 [2] iterate(::Tables.IteratorWrapper{QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}}) at /home/username/.julia/packages/QueryOperators/PfROc/src/enumerable/enumerable_filter.jl:20
 [3] iterate at ./iterators.jl:139 [inlined]
 [4] iterate at ./iterators.jl:138 [inlined]
 [5] buildcolumns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:92 [inlined]
 [6] columns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:174 [inlined]
 [7] #DataFrame#367(::Bool, ::Type, ::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:34
 [8] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:25
 [9] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}, ::Type) at ./operators.jl:813
 [10] top-level scope at none:0

CiaranOMara avatar Jun 28 '19 02:06 CiaranOMara

Alright, I just tagged a new version that reexports isna and NA from Query.jl.

The current implementation/design of Missing does not work well with important cases that iterate named tuples: there are severe performance and semantic problems (I describe some of that here). Currently this affects both Query.jl and JuliaDB.jl. DataValue solves those problems, and that is why I'm using it throughout the Queryverse. JuliaDB also brought back support for DataValue after they had migrated to Missing because they couldn't get the performance they need with Missing.

Hopefully at some point there will be a missing implementation/design in base that works with all the use-cases we have in the ecosystem, but until that is the case we'll have to live with two distinct approaches.

davidanthoff avatar Jun 28 '19 09:06 davidanthoff

I tried to filter out missing data with the ismissing function. If you are using Query with DataFrames, then it makes sense that you would compare with missing or ismissing, since the DataFrame prints missing data as missing instead of NA.

jpfairbanks avatar Apr 05 '20 20:04 jpfairbanks

@jpfairbanks there are some very deep technical reasons why Query.jl can't use missing, the design of that doesn't work well with the core design of Query.jl. So the native missing value story in Query.jl is DataValues.jl, and any missing value gets converted into the DataValues.jl equivalent at the start of any query (and converted back to missing if you materialize into a DataFrame).

davidanthoff avatar Apr 07 '20 20:04 davidanthoff

I ran into this exact same issue. I still don't understand why @filter(_.b !== missing) and @filter(ismissing(_.b) don't work. What's going on here?

samuela avatar Aug 07 '20 23:08 samuela

Any missing value is converted into the equivalent DataValue representation at the beginning of a query, and when you materialize things back into a DataFrame, things get converted back to missing. But inside of any Query.jl query, missing values are handled by DataValue. The reason for that is that we can't get decent performance and some important semantics if this was based on missing.

The easiest way to implement filters like the one you mention here is to use the new @dropna command: df |> @dropna(:b) for example will drop all rows where there is a missing value in column b.

davidanthoff avatar Aug 07 '20 23:08 davidanthoff

@davidanthoff How would I go about doing something a little bit more complicated though? For example, let's say I only want to keep rows that have a non-missing value in either column :a or column :b?

samuela avatar Aug 07 '20 23:08 samuela

Then you would use isna inside @filter: df |> @filter(isna(_.a) || isna(_.b)).

davidanthoff avatar Aug 08 '20 00:08 davidanthoff

I expected that skipmissing would filter out missing values like it does on DataFrame columns. For example:

df = DataFrame(a=[1,1,1], b=[1,2,missing])
mean(skipmissing(df.b))

returns 1.5. Whereas

df |> @groupby(_.a) |> @map({key=key(_), m=mean(skipmissing(_.b))}) |> DataFrame

still leaves the missing value in the output column.

jurta avatar Feb 11 '21 16:02 jurta