[DataFrame] DataFrame operations "obj" return type
Given a DataFrameColumn with data similar to the following:

Performing operations like Sum return values of type obj. This causes issues when trying to use it in other contexts / operations.
Example:
Suppose I want to manually calculate the sum of a numeric column and divide it by 3:
df.["PetalLength"].Sum() / 3
The code throws the following exception - The type 'int' does not match the type 'obj'
The first thought is to cast.
((float32)df.["PetalLength"].Sum())
This approache, throws exception - typecheck error The type 'obj' does not support a conversion to the type 'float32.
The two ways that actually work are the following:
Using the casting operator:
(df.["PetalLength"].Sum() :?> float32) / 3.f
or unboxing:
df.["PetalLength"].Sum() |> unbox<float32>
In a sense it's not really a bug, but the ability to provide the return type and avoid casting would make it cleaner / simpler on the user.
i.e. df.[PetalLength].Sum<float32>()
This is the 2nd time a request to infer the return type has come up this week. I'll consider different approaches here. At a high level, it feels like having strongly typed column fields based on the current schema will solve this issue, dotnet/machinelearning#5684 and the F# specific request from dotnet/machinelearning#5670.
One idea I've been thinking about here is generating fields at runtime using reflection. Something along the lines of DataFrame inferredDataFrame = new DataFrame.ReadCsv(); where we create properties on the returned DataFrame for each column in the csv file. Then code such as inferredDataFrame.PetalLength could return a PrimitiveDataFrameColumn<float> and subsequent ops on it such as Sum would return float. Not sure if this is even possible yet, so I'll prototype it next week :)
@pgovind Sounds good!
Thirded! Please, oh please, add some typing!
Firstly, its absence makes API discoverability very poor. For example, if I write
var col = new PrimitiveDataFrameColumn<int>("column of ints");
var max =col.Max();
it's entirely unclear what I'm getting back when 'max' is an object. Maybe I'm actually getting back some kind of structured objects that represents the cell rather than the value within it?
FWIW, I'm less interested in using DataFrames for Jupyter than for desktop applications so having a decent intellisense experience is pretty important.
Secondly, the lack of typing just results in increasingly ugly code with unnecessary casts or Select operations.
FWIW I wrote a DataFrame-like library for internal use. The operation to fetch a column was.
var columnOfT = frame.GetColumn<T>("name");
Although it's slightly annoying to have to specify the type in these kinds of indexing operations, it does have the advantage that subsequent accesses to the data can use type inference.
Whilst I'm wishing for things... it would be highly desirable to be able to specify columns that can't contain NULLS and to have operations that can 'clean' null cells. E.g. I'd really like to do something like...
var frame = new DataFrame.LoadCsv("someCsvWithEmptyCells.csv");
var sanitisedAges= frame
.Columns<int>("ages")
.ReplaceNull(0) ;
//The idea here is that instead of an age column that may contain NULLs,
//we've replaced them with a specific value AND generated a column
//whose type and semantics no longer allow the admission of NULL. I.e. the type
//of sanitisedAges is something like NonNullablePrimitiveDataFrame<int>
var ages = sanitisedAges.ToArray();
//results in an array of ints whereas the same operation on the original
//PrimitiveDataFrameColumn<int> object would have resulted in an array of int?s
In your specific example, max would be an int. We only lose type information when APIs are called on the base DataFrameColumn objects. We're working on a couple ways to improve this at the moment though:
- https://github.com/eerhardt/DotNetInteractiveExtension/pull/25 is enabling an extension that would make properties that return strongly typed columns on a
DataFramein Jupyter. So, now something likedf.Pricewould return aPrimitiveDataFrameColumn<float>(as opposed todf["Price"]) which would return a weakly typedDataFrameColumn - Similar to what you suggested, https://github.com/dotnet/corefxlab/pull/2827 adds the following APIs on
DataFrame:GetPrimitiveDataFrameColumn<T>(ColumnName)GetStringDataFrameColumn(ColumnName)GetArrowStringDataFrameColumn(ColumnName)
I think this solves most of our type inference problems. 1 for Jupyter and 2 for desktop users.
Also, have you looked at FillNulls? It exists on all the columns types and replaces null values with a specified value. It returns the same column type though, so the resulting column still has the ability to contain nulls. Out of curiosity, do you have examples of when a NonNullablePrimitiveDataFrameColumn<int> would be useful?
Thanks Prashanth. :-) It's possible things have moved on in the API - I'm using the current nuget release (0.2.0). In that release Max and similar functions on a PrimitiveDataFrameColumn
Also, have you looked at FillNulls? Thanks - I'd missed that - looks very useful.
Out of curiosity, do you have examples of when a NonNullablePrimitiveDataFrameColumn
would be useful?
I can't claim it's a killer requirement :-) It mainly derives from wanting to avoid having to write all my filter/mutation operations as functions accepting a nullable type (I find all those '?'s a bit ugly) and a general desire to separate the processing pipeline into a 'cleanup' phase where nulls are dropped/replaced with meaningful data and a 'processing' phase where special-cases (nulls) can be ignored. I'm probably being a it over-fastidious on this though!
Why not just DataFrame where T can be any primitive System type or even an arbitrary user-supplied type?
This stems from a desire to support the Apache Arrow format. The Arrow format lays out the memory for different primitive types and going from Arrow -> DataFrame or vice-versa is zero-copy. It also comes with the advantage that we can support hardware intrinsics much better in the future.
This has been one of my biggest annoyances when working with DataFrames and I'd love to see it fixed. Currently, I have to unbox every value with a cast. It is not clear to me why this is necessary when the data type of the value is stored in the DataFrameColumn which should be able to do the unboxing for me?
It seems like the two PRs have stalled, so is there any progress on this? If there are problems with Apache Arrow support, would it be possible to focus on native types first?