melusine icon indicating copy to clipboard operation
melusine copied to clipboard

Error with unseen values of metadata during inference

Open DataFactory-Verlingue opened this issue 4 years ago • 0 comments

Hi !

We encountered an error during our tests on Melusine.

When we train the pipeline with metadata, in the file "metadata_enginnering.py", label encoders from sklearn are trained by the values of metadatas we have in our training dataset of emails.

It allows to associate a string to a numerical value. For example, the attachment type "JPG" will be associated with the numerical value "4".

When we use again the metadata pipeline for the inference, it will call the function "transform" which call the function "encode_extension". In this function, if the value has not been seen during the training of label encoders, it will return the value "other".

So, if we have already encounter the value during the training of label encoders, it will return the numerical value associated. However, if it's a new value of metadata, unseen in the training dataset, we will have errors like that : image Because, the value "other" hasn't been used to train the label encoder, so there is no numerical value associated with this value.

We have the problem for the extension of the email address and for the type of attachment.

To fix this error, we need to add the value "other" to the list of metedata used to train the label encoders. image

I will join a PR with our modifications.

Best regards,

Maxime

Python version : 3.9.7

Melusine version : 2.3.4

Operating System : Windows

DataFactory-Verlingue avatar Apr 12 '22 15:04 DataFactory-Verlingue