DataAPI.jl icon indicating copy to clipboard operation
DataAPI.jl copied to clipboard

Confusing `levels` fallback

Open aplavin opened this issue 3 years ago • 10 comments

Looking at the concrete implementation of levels in CategoricalArrays, I see that this function conveniently extracts possible levels from both arrays and individual values:

julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
 "abc"
 "def"
 "abc"

julia> x = a[1]
CategoricalValue{String, UInt32} "abc"

julia> levels(a)
2-element Vector{String}:
 "abc"
 "def"

julia> levels(x)
2-element Vector{String}:
 "abc"
 "def"

However, the fallback implemented in DataAPI itself is only correct for collections, not individual values:

# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
 "abc"
 "def"

# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

# especially confusing given that:
julia> x == "abc"
true

Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool added instead?

aplavin avatar Oct 28 '22 08:10 aplavin

is only correct for collections

The problem is different. The function is correct, but it is defined for all collections, e.g.:

julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

The problem is, that many collections, like "abc" or 1 are perceived by users as individual values not as collections.

The same problem is with unique.

What we maybe could do is changing that levels is defined for arrays by default (the problem is that it would be breaking).

bkamins avatar Oct 28 '22 08:10 bkamins

What we maybe could do is changing that levels is defined for arrays by default

There are also arrays commonly perceived as individual values - most notably, SVectors and FieldArrays from StaticArrays.

My point is that "levels of a collection" and "levels of a value" are fundamentally different functions, if talking about their generic versions, not specific levels(::CategoricalArray) and levels(::CategoricalValue). So, maybe the fallback shouldn't be defined at all? Is there any actual usage of it?

aplavin avatar Oct 28 '22 09:10 aplavin

Let me summarize the discussion, so that I am clear what problem we want to resolve (I think you raised it in Slack but I do not remember the details, and soon it will be lost on Slack). Current state is the following:

  • for collections levels recovers levels of a collection;
  • for CategoricalValue as a special case levels recovers levels from a source array of this categorical value.

In what cases the current behavior is problematic (I mean in practice; I understand your conceptual concerns so there is no need to comment on this).

bkamins avatar Oct 28 '22 10:10 bkamins

In what cases the current behavior is problematic

First, I saw the levels(::CategoricalValue) function. It returns the set of levels the value is chosen from, exactly what I needed. Then, I wanted to support both categorical and non-categorical values, and needed a way to retrieve levels only if they are defined for the value. So, I tried levels(any other value) such as levels("abc"). I expected nothing as the result, or at worst ("abc",), but definitely not ['a', 'b', 'c'].

So, currently it looks like levels is hardly useable with generic values - one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

aplavin avatar Oct 31 '22 08:10 aplavin

one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

I understand the current design the opposite way (I do not want to say that it should not be changed, but I want to express my current understanding of the design):

  • levels normally should be only called on a collection; you should not call it on a value that is not a collection
  • as a special case, if you know that you have a CategoricalValue you can call it to get levels of a parent of this value

So, you should not call levels on any value that is not a collection, unless you know beforehand that it is a CategoricalValue.

Therefore, what you essentially ask for is how to check that a non-collection is CategoricalValue without having CategoricalArrays.jl as a dependency. This could be doable (e.g. we could add some DataAPI.jl method to check for it). However, a normal way to do it would be to add dependency on CategoricalArrays.jl and just check if some value is a CategoricalValue.

An alternative would be to add something like parentlevels function, so that parentlevels would be defined for CategoricalValue as the same as levels, and would be nothing by default as you want. @nalimilan - what do you think?

bkamins avatar Oct 31 '22 08:10 bkamins

Yes, I do understand the levels-of-collection PoV, maybe it's more common indeed. I'm talking from the perspective of someone who only needed "parent levels" of a single value, and found that the levels method does exactly what's needed for CategoricalValues (but only for them).

aplavin avatar Oct 31 '22 09:10 aplavin

Agreed. That is why maybe adding parentlevels to be explicit makes sense. Let us wait for @nalimilan to respond what he things, as he designed these features.

bkamins avatar Oct 31 '22 09:10 bkamins

Yes, I think the best solution would be to add an API that allows checking whether a value or array is categorical, whose only implementations (currently) would be CategoricalArray and CategoricalValue.

It's true that the fact that the levels fallback works on any iterable value and can return weird results for strings and numbers is a bit confusing, but introducing parentlevels would be overkill IMO.

nalimilan avatar Nov 02 '22 14:11 nalimilan

And for the mean time having CategoricalArrays.jl as a dependency is a temporary solution.

bkamins avatar Nov 02 '22 15:11 bkamins