Handle null/empty values better for training and consumption
System Information (please complete the following information):
- Model Builder Version: ml.net CLI 16.1.1
- Visual Studio Version 8.7.6 for macs
Describe the bug Tried to change model input to support nullable types to support my data set, ConsumeModel throws error:
"System.ArgumentOutOfRangeException has been thrown, "Could not determine an IDataView type for member Flash_point (Parameter 'rawType')""
Have inputs, but some are missing values. (Empty cell, but are NOT zero)
To Reproduce Steps to reproduce the behavior:
- Train model with data with missing values
- Try to run new data with missing values thru model
- Change modelinput float options to float?
- See error
Expected behavior Handle nullable types since my data includes some empty cells
Screenshots NA
Additional context NA
Thanks for reporting this issue! I believe this is a limitation of ML.NETs Predict Engine but I'll follow up with the ML.NET team to confirm.
Internal to Model Builder we use the following code for our Evaluate tab ... I think it might be helpful for predicting with some missing values. I tried to grab all the relevant code ... please report back if something is missing. I'll work on simplifying this code next week.
public async Task<KeyValuePair<string, float>> PredictRegressionAsync(string modelPath, IDictionary<string, object> values, string labelColumnName, string scoreColumnName = "Score")
{
return await Task.Run(() =>
{
(var model, var schema) = this.LoadModelAndInputSchemaFromFile(modelPath);
var inputDataView = this.IDictionaryToIDataView(values, schema);
var resultDataView = model.Transform(inputDataView);
var score = this.ExtractColumnListFromIDataView<float>(resultDataView, scoreColumnName).First();
return new KeyValuePair<string, float>(labelColumnName, score);
});
}
private (ITransformer, DataViewSchema) LoadModelAndInputSchemaFromFile(string modelPath)
{
if (this.modelCache.ContainsKey(modelPath))
{
return this.modelCache[modelPath];
}
var model = this.mlContext.Model.Load(modelPath, out var schema);
this.modelCache.Add(modelPath, (model, schema));
return this.modelCache[modelPath];
}
private IDataView IDictionaryToIDataView(IDictionary<string, object> dictionary, DataViewSchema schema)
{
return new DataFrame(schema.AsEnumerable().Where(x => dictionary.ContainsKey(x.Name)).Select(x => x.ToDataFrameColumn(dictionary[x.Name])));
}
private IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
{
var column = dataView.Schema[columnName];
return this.GetColumnValueAsList<T>(dataView, column);
}
private IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
{
var column = dataView.Schema[columnName];
return this.GetColumnValueAsList<T>(dataView, column);
}
private List<T> GetColumnValueAsList<T>(IDataView res, DataViewSchema.Column column)
{
return res.GetColumn<T>(column).ToList();
}
Attached is a simplified version. There is still room for more simplification but you should just need to call into PredictRegression with the ModelPath, Key Value pairs in a dictionary representing the name and value for inputs, and then specify the label column (Column to predict).
Simplified Version:
public class ConsumeModel
{
public static KeyValuePair<string, float> PredictRegression(string modelPath, IDictionary<string, string> values, string labelColumnName, string scoreColumnName = "Score")
{
(var model, var schema) = ConsumeModel.LoadModelAndInputSchemaFromFile(modelPath);
var inputDataView = ConsumeModel.IDictionaryToIDataView(values, schema);
var resultDataView = model.Transform(inputDataView);
var score = ConsumeModel.ExtractColumnListFromIDataView<float>(resultDataView, scoreColumnName).First();
return new KeyValuePair<string, float>(labelColumnName, score);
}
private static (ITransformer, DataViewSchema) LoadModelAndInputSchemaFromFile(string modelPath)
{
MLContext mlContext = new MLContext();
var model = mlContext.Model.Load(modelPath, out var schema);
return (model, schema);
}
private static IDataView IDictionaryToIDataView(IDictionary<string, string> dictionary, DataViewSchema schema)
{
return new DataFrame(schema.AsEnumerable().Where(x => dictionary.ContainsKey(x.Name)).Select(x => ToDataFrameColumn(x,dictionary[x.Name])));
}
private static IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
{
var column = dataView.Schema[columnName];
return ConsumeModel.GetColumnValueAsList<T>(dataView, column);
}
private static List<T> GetColumnValueAsList<T>(IDataView res, DataViewSchema.Column column)
{
return res.GetColumn<T>(column).ToList();
}
private static DataFrameColumn ToDataFrameColumn(DataViewSchema.Column column, string value)
{
if (column.Type is TextDataViewType)
{
var columns = new StringDataFrameColumn(column.Name, 0);
columns.Append(value);
return columns;
}
else if (column.Type.RawType == typeof(bool))
{
var primitiveColumn = new BooleanDataFrameColumn(column.Name);
try
{
primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToBoolean(value) : (bool?)null);
}
catch
{
throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
}
return primitiveColumn;
}
else if (column.Type.RawType == typeof(int))
{
var primitiveColumn = new Int32DataFrameColumn(column.Name);
try
{
primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToInt32(value) : (int?)null);
}
catch
{
throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
}
return primitiveColumn;
}
else if (column.Type.RawType == typeof(float))
{
var primitiveColumn = new SingleDataFrameColumn(column.Name);
try
{
primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToSingle(value) : (float?)null);
}
catch
{
throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
}
return primitiveColumn;
}
else
{
throw new NotImplementedException();
}
}
}
I believe the above code requires these NuGets:
<PackageReference Include="Microsoft.Data.Analysis" Version="0.4.0" />
<PackageReference Include="Microsoft.ML.DataView" Version="1.5.1" />
@LittleLittleCloud @harishsk @justinormont @eerhardt - Is there an easier way to predict with nullable inputs?
tldr; Generally if numeric, I would use a float type w/ a Single.NaN value.
ML․NET had nullable types for int, uint, long, short, byte, bool, etc until late 2018 when it was removed. This simplified a code a bit. See: https://github.com/dotnet/machinelearning/issues/673
Previously, these values all were nullable and used that to represent a missing value. Other types have a value built-in to represent a missing value, like Single.NaN; those remain.
The main types that can represent missing values inputs are: float (Single.NaN), double (Double.NaN), Key (internal value of 0), string (String.Empty).
The main negative impact I found to not supporting nullable types is that it was detrimental for binary classification. Previously, binary classification allowed label types of { Key, single, and bool }. This was reduced to only bool, and the nullable portion of bool was removed. These two parts together meant that any dataset with missing labels values could not be run. This is rather common (called semi-supervised learning), and is the root of why I recommend running datasets as multi-class instead of binary classification.
Related issues for binary classification: (sorry for the private repo links)
- https://github.com/dotnet/machinelearning-automl/issues/390
- https://github.com/dotnet/machinelearning-automl/issues/420
- https://github.com/dotnet/machinelearning-automl/issues/386
- https://github.com/dotnet/machinelearning-automl/issues/255
- https://github.com/dotnet/machinelearning-automl/issues/91
- https://github.com/dotnet/machinelearning-tools/pull/534#discussion_r386627165
There's a trick that lets you preserve the missing values of ints (and others). Bring the values in as text, which lets your run the IndicateMissingValues() to note that it's missing (or otherwise operate on the missing value), then run ConvertType() to convert the text to int, leaving you with one int column, and one missing indicator column.
Thanks @justinormont! I’ll take look at that approach and see how to it compares to the above approach (my initial approach) ... for both ease of use and performance. I’m actually not sure if my approach is doing what I think it is. It seems to work better than replacing floats with 0s but I don’t know the code path in ML.NET very well.
Found a few more users hitting this issue:
https://stackoverflow.com/questions/60663885/does-ml-net-accepts-null https://stackoverflow.com/questions/62235411/ml-net-cant-use-nullable-types-in-idataview-how-do-i-handle-missing-data
This should be moved to the ML repo. @JakeRadMSFT
@michaelgsharp Can you help move this over?
Any updates about handling null values?