Confusing `levels` fallback
Looking at the concrete implementation of levels in CategoricalArrays, I see that this function conveniently extracts possible levels from both arrays and individual values:
julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
"abc"
"def"
"abc"
julia> x = a[1]
CategoricalValue{String, UInt32} "abc"
julia> levels(a)
2-element Vector{String}:
"abc"
"def"
julia> levels(x)
2-element Vector{String}:
"abc"
"def"
However, the fallback implemented in DataAPI itself is only correct for collections, not individual values:
# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
"abc"
"def"
# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
# especially confusing given that:
julia> x == "abc"
true
Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool added instead?
is only correct for collections
The problem is different. The function is correct, but it is defined for all collections, e.g.:
julia> levels("abc")
3-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
The problem is, that many collections, like "abc" or 1 are perceived by users as individual values not as collections.
The same problem is with unique.
What we maybe could do is changing that levels is defined for arrays by default (the problem is that it would be breaking).
What we maybe could do is changing that levels is defined for arrays by default
There are also arrays commonly perceived as individual values - most notably, SVectors and FieldArrays from StaticArrays.
My point is that "levels of a collection" and "levels of a value" are fundamentally different functions, if talking about their generic versions, not specific levels(::CategoricalArray) and levels(::CategoricalValue). So, maybe the fallback shouldn't be defined at all? Is there any actual usage of it?
Let me summarize the discussion, so that I am clear what problem we want to resolve (I think you raised it in Slack but I do not remember the details, and soon it will be lost on Slack). Current state is the following:
- for collections
levelsrecovers levels of a collection; - for
CategoricalValueas a special caselevelsrecoverslevelsfrom a source array of this categorical value.
In what cases the current behavior is problematic (I mean in practice; I understand your conceptual concerns so there is no need to comment on this).
In what cases the current behavior is problematic
First, I saw the levels(::CategoricalValue) function. It returns the set of levels the value is chosen from, exactly what I needed.
Then, I wanted to support both categorical and non-categorical values, and needed a way to retrieve levels only if they are defined for the value. So, I tried levels(any other value) such as levels("abc"). I expected nothing as the result, or at worst ("abc",), but definitely not ['a', 'b', 'c'].
So, currently it looks like levels is hardly useable with generic values - one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.
one needs to know in advance that the input is a
CategoricalValue, otherwiselevels(val)don't make much sense.
I understand the current design the opposite way (I do not want to say that it should not be changed, but I want to express my current understanding of the design):
-
levelsnormally should be only called on a collection; you should not call it on a value that is not a collection - as a special case, if you know that you have a
CategoricalValueyou can call it to getlevelsof a parent of this value
So, you should not call levels on any value that is not a collection, unless you know beforehand that it is a CategoricalValue.
Therefore, what you essentially ask for is how to check that a non-collection is CategoricalValue without having CategoricalArrays.jl as a dependency. This could be doable (e.g. we could add some DataAPI.jl method to check for it). However, a normal way to do it would be to add dependency on CategoricalArrays.jl and just check if some value is a CategoricalValue.
An alternative would be to add something like parentlevels function, so that parentlevels would be defined for CategoricalValue as the same as levels, and would be nothing by default as you want. @nalimilan - what do you think?
Yes, I do understand the levels-of-collection PoV, maybe it's more common indeed. I'm talking from the perspective of someone who only needed "parent levels" of a single value, and found that the levels method does exactly what's needed for CategoricalValues (but only for them).
Agreed. That is why maybe adding parentlevels to be explicit makes sense. Let us wait for @nalimilan to respond what he things, as he designed these features.
Yes, I think the best solution would be to add an API that allows checking whether a value or array is categorical, whose only implementations (currently) would be CategoricalArray and CategoricalValue.
It's true that the fact that the levels fallback works on any iterable value and can return weird results for strings and numbers is a bit confusing, but introducing parentlevels would be overkill IMO.
And for the mean time having CategoricalArrays.jl as a dependency is a temporary solution.