InMemoryDatasets.jl icon indicating copy to clipboard operation
InMemoryDatasets.jl copied to clipboard

maximum of column with missing

Open sprmnt21 opened this issue 3 years ago • 3 comments

Trying to follow some examples from the tutorial, I found different outputs than expected(as showed in the documentation).

julia> ds = Dataset(g = [2, 1, 1, 2, 2],
                                 x1_int = [0, 0, 1, missing, 2],
                                 x2_int = [3, 2, 1, 3, -2],
                                 x1_float = [1.2, missing, -1.0, 2.3, 10],
                                 x2_float = [missing, missing, 3.0, missing, missing],     
                                 x3_float = [missing, missing, -1.4, 3.0, -100.0])
5×6 Dataset
 Row │ g         x1_int    x2_int    x1_float   x2_float   x3_float
     │ identity  identity  identity  identity   identity   identity
     │ Int64?    Int64?    Int64?    Float64?   Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1 │        2         0         3        1.2  missing    missing
   2 │        1         0         2  missing    missing    missing
   3 │        1         1         1       -1.0        3.0       -1.4
   4 │        2   missing         3        2.3  missing          3.0
   5 │        2         2        -2       10.0  missing       -100.0

julia> groupby!(ds, 1)
5×6 Grouped Dataset with 2 groups
Grouped by: g
 Row │ g         x1_int    x2_int    x1_float   x2_float   x3_float  
     │ identity  identity  identity  identity   identity   identity  
     │ Int64?    Int64?    Int64?    Float64?   Float64?   Float64?  
─────┼───────────────────────────────────────────────────────────────
   1 │        1         0         2  missing    missing    missing   
   2 │        1         1         1       -1.0        3.0       -1.4
   3 │        2         0         3        1.2  missing    missing   
   4 │        2   missing         3        2.3  missing          3.0
   5 │        2         2        -2       10.0  missing       -100.0

julia> modify(ds, r"int" => x -> x .- maximum(x))
5×6 Grouped Dataset with 2 groups
Grouped by: g
 Row │ g         x1_int    x2_int    x1_float   x2_float   x3_float  
     │ identity  identity  identity  identity   identity   identity  
     │ Int64?    Int64?    Int64?    Float64?   Float64?   Float64?  
─────┼───────────────────────────────────────────────────────────────
   1 │        1        -1         0  missing    missing    missing   
   2 │        1         0        -1       -1.0        3.0       -1.4
   3 │        2   missing         0        1.2  missing    missing
   4 │        2   missing         0        2.3  missing          3.0
   5 │        2   missing        -5       10.0  missing       -100.0

julia> combine(ds, :x1_int => x -> maximum(x))
2×2 Dataset
 Row │ g         function_x1_int 
     │ identity  identity
     │ Int64?    Int64?
─────┼───────────────────────────
   1 │        1                1
   2 │        2          missing

The behavior does not appear to be closely associated with group by

 julia> ungroup!(ds)
5×6 Sorted Dataset
 Sorted by: g
 Row │ g         x1_int    x2_int    x1_float   x2_float   x3_float  
     │ identity  identity  identity  identity   identity   identity
     │ Int64?    Int64?    Int64?    Float64?   Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1 │        1         0         2  missing    missing    missing
   2 │        1         1         1       -1.0        3.0       -1.4
   3 │        2         0         3        1.2  missing    missing
   4 │        2   missing         3        2.3  missing          3.0
   5 │        2         2        -2       10.0  missing       -100.0

julia> combine(ds, :x1_int => x -> maximum(x))
1×1 Dataset
 Row │ function_x1_int 
     │ identity
     │ Int64?
─────┼─────────────────
   1 │         missing

My status

(v1.7) pkg> status
      Status `C:\Users\sprmn\.julia\v1.7\Project.toml`
  [8be319e6] Chain v0.4.10
  [35d6a980] ColorSchemes v3.17.1
  [5ae59095] Colors v0.12.8
  [f7bf1975] Impute v0.6.8
  [5c01b14b] InMemoryDatasets v0.6.10
  [8197267c] IntervalSets v0.6.0
  [c8e1da08] IterTools v1.4.0
  [08abe8d2] PrettyTables v1.3.1
  [2913bbd2] StatsBase v0.33.16
  [bd369af6] Tables v1.7.0

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

sprmnt21 avatar Apr 03 '22 09:04 sprmnt21

the maximum function returns missing when any of the values in a column is missing. Change maximum to IMD.maximum to automatically skip missings.

sl-solution avatar Apr 03 '22 09:04 sl-solution

then the issue is in the doc https://docs.juliahub.com/InMemoryDatasets/cS87e/0.4.0/man/grouping/, which, but I notice only now, is related to an old version of IMD.

sprmnt21 avatar Apr 03 '22 11:04 sprmnt21

I see. Before we were overriding the Base functions, however, it has been fixed since v.0.6.10.

sl-solution avatar Apr 03 '22 11:04 sl-solution