machinelearning Handle null/empty values better for training and consumption

System Information (please complete the following information):

Model Builder Version: ml.net CLI 16.1.1
Visual Studio Version 8.7.6 for macs

Describe the bug Tried to change model input to support nullable types to support my data set, ConsumeModel throws error:

"System.ArgumentOutOfRangeException has been thrown, "Could not determine an IDataView type for member Flash_point (Parameter 'rawType')""

Have inputs, but some are missing values. (Empty cell, but are NOT zero)

To Reproduce Steps to reproduce the behavior:

Train model with data with missing values
Try to run new data with missing values thru model
Change modelinput float options to float?
See error

Expected behavior Handle nullable types since my data includes some empty cells

Screenshots NA

Additional context NA

Jul 26 '20 06:07 bizbizzz

Thanks for reporting this issue! I believe this is a limitation of ML.NETs Predict Engine but I'll follow up with the ML.NET team to confirm.

Internal to Model Builder we use the following code for our Evaluate tab ... I think it might be helpful for predicting with some missing values. I tried to grab all the relevant code ... please report back if something is missing. I'll work on simplifying this code next week.

public async Task<KeyValuePair<string, float>> PredictRegressionAsync(string modelPath, IDictionary<string, object> values, string labelColumnName, string scoreColumnName = "Score")
        {
            return await Task.Run(() =>
            {
                (var model, var schema) = this.LoadModelAndInputSchemaFromFile(modelPath);
                var inputDataView = this.IDictionaryToIDataView(values, schema);
                var resultDataView = model.Transform(inputDataView);
                var score = this.ExtractColumnListFromIDataView<float>(resultDataView, scoreColumnName).First();

                return new KeyValuePair<string, float>(labelColumnName, score);
            });
        }

private (ITransformer, DataViewSchema) LoadModelAndInputSchemaFromFile(string modelPath)
        {
            if (this.modelCache.ContainsKey(modelPath))
            {
                return this.modelCache[modelPath];
            }

            var model = this.mlContext.Model.Load(modelPath, out var schema);
            this.modelCache.Add(modelPath, (model, schema));

            return this.modelCache[modelPath];
        }

private IDataView IDictionaryToIDataView(IDictionary<string, object> dictionary, DataViewSchema schema)
        {
            return new DataFrame(schema.AsEnumerable().Where(x => dictionary.ContainsKey(x.Name)).Select(x => x.ToDataFrameColumn(dictionary[x.Name])));
        }

        private IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
        {
            var column = dataView.Schema[columnName];
            return this.GetColumnValueAsList<T>(dataView, column);
        }

private IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
        {
            var column = dataView.Schema[columnName];
            return this.GetColumnValueAsList<T>(dataView, column);
        }

private List<T> GetColumnValueAsList<T>(IDataView res, DataViewSchema.Column column)
        {
            return res.GetColumn<T>(column).ToList();
        }

Jul 26 '20 17:07 JakeRadMSFT

Attached is a simplified version. There is still room for more simplification but you should just need to call into PredictRegression with the ModelPath, Key Value pairs in a dictionary representing the name and value for inputs, and then specify the label column (Column to predict).

Simplified Version:

public class ConsumeModel
    {
        public static KeyValuePair<string, float> PredictRegression(string modelPath, IDictionary<string, string> values, string labelColumnName, string scoreColumnName = "Score")
        {
            (var model, var schema) = ConsumeModel.LoadModelAndInputSchemaFromFile(modelPath);
            var inputDataView = ConsumeModel.IDictionaryToIDataView(values, schema);
            var resultDataView = model.Transform(inputDataView);
            var score = ConsumeModel.ExtractColumnListFromIDataView<float>(resultDataView, scoreColumnName).First();

            return new KeyValuePair<string, float>(labelColumnName, score);
        }

        private static (ITransformer, DataViewSchema) LoadModelAndInputSchemaFromFile(string modelPath)
        {
            MLContext mlContext = new MLContext();
            var model = mlContext.Model.Load(modelPath, out var schema);
            return (model, schema);
        }

        private static IDataView IDictionaryToIDataView(IDictionary<string, string> dictionary, DataViewSchema schema)
        {
            return new DataFrame(schema.AsEnumerable().Where(x => dictionary.ContainsKey(x.Name)).Select(x => ToDataFrameColumn(x,dictionary[x.Name])));
        }

        private static IList<T> ExtractColumnListFromIDataView<T>(IDataView dataView, string columnName)
        {
            var column = dataView.Schema[columnName];
            return ConsumeModel.GetColumnValueAsList<T>(dataView, column);
        }

        private static List<T> GetColumnValueAsList<T>(IDataView res, DataViewSchema.Column column)
        {
            return res.GetColumn<T>(column).ToList();
        }

        private static DataFrameColumn ToDataFrameColumn(DataViewSchema.Column column, string value)
        {
            if (column.Type is TextDataViewType)
            {
                var columns = new StringDataFrameColumn(column.Name, 0);
                columns.Append(value);
                return columns;
            }
            else if (column.Type.RawType == typeof(bool))
            {
                var primitiveColumn = new BooleanDataFrameColumn(column.Name);
                try
                {
                    primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToBoolean(value) : (bool?)null);
                }
                catch
                {
                    throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
                }

                return primitiveColumn;
            }
            else if (column.Type.RawType == typeof(int))
            {
                var primitiveColumn = new Int32DataFrameColumn(column.Name);
                try
                {
                    primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToInt32(value) : (int?)null);
                }
                catch
                {
                    throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
                }

                return primitiveColumn;
            }
            else if (column.Type.RawType == typeof(float))
            {
                var primitiveColumn = new SingleDataFrameColumn(column.Name);
                try
                {
                    primitiveColumn.Append(!string.IsNullOrWhiteSpace(value) ? Convert.ToSingle(value) : (float?)null);
                }
                catch
                {
                    throw new InvalidCastException(string.Format("Input string for {0} is not in the correct format.", column.Name));
                }

                return primitiveColumn;
            }
            else
            {
                throw new NotImplementedException();
            }
        }
    }

I believe the above code requires these NuGets:

<PackageReference Include="Microsoft.Data.Analysis" Version="0.4.0" />
<PackageReference Include="Microsoft.ML.DataView" Version="1.5.1" />

Jul 26 '20 23:07 JakeRadMSFT

@LittleLittleCloud @harishsk @justinormont @eerhardt - Is there an easier way to predict with nullable inputs?

Jul 26 '20 23:07 JakeRadMSFT

tldr; Generally if numeric, I would use a float type w/ a Single.NaN value.

ML․NET had nullable types for int, uint, long, short, byte, bool, etc until late 2018 when it was removed. This simplified a code a bit. See: https://github.com/dotnet/machinelearning/issues/673

Previously, these values all were nullable and used that to represent a missing value. Other types have a value built-in to represent a missing value, like Single.NaN; those remain.

The main types that can represent missing values inputs are: float (Single.NaN), double (Double.NaN), Key (internal value of 0), string (String.Empty).

The main negative impact I found to not supporting nullable types is that it was detrimental for binary classification. Previously, binary classification allowed label types of { Key, single, and bool }. This was reduced to only bool, and the nullable portion of bool was removed. These two parts together meant that any dataset with missing labels values could not be run. This is rather common (called semi-supervised learning), and is the root of why I recommend running datasets as multi-class instead of binary classification.

Related issues for binary classification: (sorry for the private repo links)

https://github.com/dotnet/machinelearning-automl/issues/390
https://github.com/dotnet/machinelearning-automl/issues/420
https://github.com/dotnet/machinelearning-automl/issues/386
https://github.com/dotnet/machinelearning-automl/issues/255
https://github.com/dotnet/machinelearning-automl/issues/91
https://github.com/dotnet/machinelearning-tools/pull/534#discussion_r386627165

There's a trick that lets you preserve the missing values of ints (and others). Bring the values in as text, which lets your run the IndicateMissingValues() to note that it's missing (or otherwise operate on the missing value), then run ConvertType() to convert the text to int, leaving you with one int column, and one missing indicator column.

Jul 27 '20 01:07 justinormont

Thanks @justinormont! I’ll take look at that approach and see how to it compares to the above approach (my initial approach) ... for both ease of use and performance. I’m actually not sure if my approach is doing what I think it is. It seems to work better than replacing floats with 0s but I don’t know the code path in ML.NET very well.

Jul 27 '20 04:07 JakeRadMSFT

Found a few more users hitting this issue:

https://stackoverflow.com/questions/60663885/does-ml-net-accepts-null https://stackoverflow.com/questions/62235411/ml-net-cant-use-nullable-types-in-idataview-how-do-i-handle-missing-data

Jul 27 '20 18:07 JakeRadMSFT

This should be moved to the ML repo. @JakeRadMSFT

Sep 02 '21 19:09 beccamc

@michaelgsharp Can you help move this over?

Sep 30 '21 18:09 JakeRadMSFT

Any updates about handling null values?

Apr 26 '24 11:04 piotrkazmierczak2323