tribuo changes types between input dataset and prediction
see https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/tribuo.20prediction.20datatype.20does.20not.20match
(ns scicloj.ml.tribuo
(:require
[tech.v3.dataset :as ds]
[tech.v3.dataset.modelling :as ds-model]
[tech.v3.libs.tribuo :as tribuo])
(:import
(com.oracle.labs.mlrg.olcut.config DescribeConfigurable)
(org.tribuo.classification.sgd.linear LogisticRegressionTrainer)))
(def logreg-trainer (LogisticRegressionTrainer.))
(def dummy-ds
(->
(ds/->dataset {:x [1 1] :y [0 1]})
(ds-model/set-inference-target :y)))
(-> dummy-ds :y seq)
;; => (0 1)
(def m (tribuo/train-classification logreg-trainer dummy-ds))
(->
(tribuo/predict-classification m dummy-ds)
:prediction
seq)
;; => ("0" "0")
This is problematic as usual accuracy is calculated by comparing:
[0 1] and ["0" "0"], which is eventually problematic in automatic evaluations as nobody might see the different types,
and this gets evaluated as "non matching" even though they do match.
In metamorph.ml and its predict method I will try to fail on all this situations
Any news on this issue ? The fact that the tribuo trainer changes he "dataype" in its prediction from float to string makes it a bad player among the different models. A model which is trained on [0 1 0 1 ..] (as int), should never predict "0" or "1". I think a well behaving (classification) model should never predict anything which it never saw in training data. 0 and "0" is not the same in this context.
I agree
I think the "problem" here is , that the Classification in Tribuo has only foreseen, that training data target column is of type "Label" (which is a String fundamentally). In java this is guarantied by the type system and its generic types. In Clojure we circumvent this, which in some for shows a "bug" in Tribuo, but it cannot happen using Java Code. (so it's not a bug)
I went deeper. The issue is this line: https://github.com/techascent/tech.ml.dataset/blob/b0896cc6116ad6aa049fb7f1b955e9fe49b07ae8/src/tech/v3/libs/tribuo.clj#L126
in which the code converts keyword,numbers and Strings to "string", and forgets about initial type. Probably we need to remember" the original type in some way, and convert back after prediction.