Filtering missing
I encountered unexpected behaviour when attempting to filter values of type Missing. I found a solution at https://discourse.julialang.org/t/query-jl-filtering-on-missing-data/14898. I suppose this issue is a feature request for documentation that clarifies missing values in Query.jl.
Anyhow, the case is as follows.
julia> using DataFrames, Query
julia> df = DataFrame(a=[1,2,3], b=[1,2,missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ missing │
Attempting to filter for rows without values missing.
julia> df |> @filter(_.b !== missing) |> DataFrame
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ missing │
# Expected behaviour.
julia> df[df.b .!== missing, :]
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
Attempting to filter for rows with values missing.
julia> df |> @filter(_.b === missing) |> DataFrame
0×2 DataFrame
# Expected behaviour.
julia> df[df.b .=== missing, :]
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼─────────┤
│ 1 │ 3 │ missing │
Using DataValues.jl's isna function solution provides the expected result.
df |> @filter(!Query.isna(_.b)) |> DataFrame
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
julia> df |> @filter(Query.isna(_.b)) |> DataFrame
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼─────────┤
│ 1 │ 3 │ missing │
You can also just compare with NA: @filter(_.b==NA) should also work.
I thought we reexported isna from Query? If not, we should.
Doc PRs are always very welcome!
Before having a go at adding something to the Doc, I have a confession and a few questions.
It was very natural for me to attempt a filter with the missing comparison as it wasn't immediately obvious to me that I was comparing against or performing operations on the DataValue container.
I'm having trouble discerning when DataValues become necessary in setting up an IterableTable and why plain NamedTuples are not sufficient. Would you please shed some light on this?
Also, what was the distinction such that the function ismissing was not included as a lifted function?
I thought we reexported isna from Query? If not, we should.
Currently the isna and NA are not re-exported from Query.jl.
(query_test) pkg> st
Project query_test v0.1.0
Status `~/Programming/Julia/query_test/Project.toml`
[a93c6f00] DataFrames v0.18.3
[1a8c2f83] Query v0.11.0
julia> using Query, DataFrames
[ Info: Recompiling stale cache file /home/username/.julia/compiled/v1.1/DataFrames/AR9oZ.ji for DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
julia> df = DataFrame(a=[1,2,3], b=[1,2,missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64⍰ │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ missing │
julia> df |> @filter(isna(_.b)) |> DataFrame
ERROR: UndefVarError: isna not defined
Stacktrace:
[1] (::getfield(Main, Symbol("##8#10")))(::NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}}) at /home/username/.julia/packages/Query/UeHfO/src/query_translation.jl:58
[2] iterate(::Tables.IteratorWrapper{QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}}) at /home/username/.julia/packages/QueryOperators/PfROc/src/enumerable/enumerable_filter.jl:20
[3] iterate at ./iterators.jl:139 [inlined]
[4] iterate at ./iterators.jl:138 [inlined]
[5] buildcolumns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:92 [inlined]
[6] columns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:174 [inlined]
[7] #DataFrame#367(::Bool, ::Type, ::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:34
[8] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:25
[9] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##8#10"))}, ::Type) at ./operators.jl:813
[10] top-level scope at none:0
julia> df |> @filter(_.b == NA) |> DataFrame
ERROR: UndefVarError: NA not defined
Stacktrace:
[1] (::getfield(Main, Symbol("##12#14")))(::NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}}) at /home/username/.julia/packages/Query/UeHfO/src/query_translation.jl:58
[2] iterate(::Tables.IteratorWrapper{QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}}) at /home/username/.julia/packages/QueryOperators/PfROc/src/enumerable/enumerable_filter.jl:20
[3] iterate at ./iterators.jl:139 [inlined]
[4] iterate at ./iterators.jl:138 [inlined]
[5] buildcolumns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:92 [inlined]
[6] columns at /home/username/.julia/packages/Tables/2uVGl/src/fallbacks.jl:174 [inlined]
[7] #DataFrame#367(::Bool, ::Type, ::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:34
[8] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}) at /home/username/.julia/packages/DataFrames/CZrca/src/other/tables.jl:25
[9] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,DataValues.DataValue{Int64}}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Union{Missing, Int64},1}}}}}},getfield(Main, Symbol("##12#14"))}, ::Type) at ./operators.jl:813
[10] top-level scope at none:0
Alright, I just tagged a new version that reexports isna and NA from Query.jl.
The current implementation/design of Missing does not work well with important cases that iterate named tuples: there are severe performance and semantic problems (I describe some of that here). Currently this affects both Query.jl and JuliaDB.jl. DataValue solves those problems, and that is why I'm using it throughout the Queryverse. JuliaDB also brought back support for DataValue after they had migrated to Missing because they couldn't get the performance they need with Missing.
Hopefully at some point there will be a missing implementation/design in base that works with all the use-cases we have in the ecosystem, but until that is the case we'll have to live with two distinct approaches.
I tried to filter out missing data with the ismissing function. If you are using Query with DataFrames, then it makes sense that you would compare with missing or ismissing, since the DataFrame prints missing data as missing instead of NA.
@jpfairbanks there are some very deep technical reasons why Query.jl can't use missing, the design of that doesn't work well with the core design of Query.jl. So the native missing value story in Query.jl is DataValues.jl, and any missing value gets converted into the DataValues.jl equivalent at the start of any query (and converted back to missing if you materialize into a DataFrame).
I ran into this exact same issue. I still don't understand why @filter(_.b !== missing) and @filter(ismissing(_.b) don't work. What's going on here?
Any missing value is converted into the equivalent DataValue representation at the beginning of a query, and when you materialize things back into a DataFrame, things get converted back to missing. But inside of any Query.jl query, missing values are handled by DataValue. The reason for that is that we can't get decent performance and some important semantics if this was based on missing.
The easiest way to implement filters like the one you mention here is to use the new @dropna command: df |> @dropna(:b) for example will drop all rows where there is a missing value in column b.
@davidanthoff How would I go about doing something a little bit more complicated though? For example, let's say I only want to keep rows that have a non-missing value in either column :a or column :b?
Then you would use isna inside @filter: df |> @filter(isna(_.a) || isna(_.b)).
I expected that skipmissing would filter out missing values like it does on DataFrame columns. For example:
df = DataFrame(a=[1,1,1], b=[1,2,missing])
mean(skipmissing(df.b))
returns 1.5. Whereas
df |> @groupby(_.a) |> @map({key=key(_), m=mean(skipmissing(_.b))}) |> DataFrame
still leaves the missing value in the output column.