tech.ml.dataset icon indicating copy to clipboard operation
tech.ml.dataset copied to clipboard

tribuo changes types between input dataset and prediction

Open behrica opened this issue 2 years ago • 5 comments

see https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/tribuo.20prediction.20datatype.20does.20not.20match

(ns scicloj.ml.tribuo
  (:require
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.modelling :as ds-model]
   [tech.v3.libs.tribuo :as tribuo])
  (:import
   (com.oracle.labs.mlrg.olcut.config DescribeConfigurable)
   (org.tribuo.classification.sgd.linear LogisticRegressionTrainer)))



(def logreg-trainer (LogisticRegressionTrainer.))

(def dummy-ds
  (->
   (ds/->dataset {:x [1 1] :y [0 1]})
   (ds-model/set-inference-target :y)))
(-> dummy-ds :y seq)
;; => (0 1)


(def m (tribuo/train-classification logreg-trainer dummy-ds))

(->
 (tribuo/predict-classification m dummy-ds)
 :prediction
 seq)
;; => ("0" "0")

behrica avatar Jan 30 '24 17:01 behrica

This is problematic as usual accuracy is calculated by comparing: [0 1] and ["0" "0"], which is eventually problematic in automatic evaluations as nobody might see the different types, and this gets evaluated as "non matching" even though they do match. In metamorph.ml and its predict method I will try to fail on all this situations

behrica avatar Feb 04 '24 18:02 behrica

Any news on this issue ? The fact that the tribuo trainer changes he "dataype" in its prediction from float to string makes it a bad player among the different models. A model which is trained on [0 1 0 1 ..] (as int), should never predict "0" or "1". I think a well behaving (classification) model should never predict anything which it never saw in training data. 0 and "0" is not the same in this context.

behrica avatar Apr 07 '24 13:04 behrica

I agree

cnuernber avatar Apr 08 '24 12:04 cnuernber

I think the "problem" here is , that the Classification in Tribuo has only foreseen, that training data target column is of type "Label" (which is a String fundamentally). In java this is guarantied by the type system and its generic types. In Clojure we circumvent this, which in some for shows a "bug" in Tribuo, but it cannot happen using Java Code. (so it's not a bug)

behrica avatar May 14 '24 16:05 behrica

I went deeper. The issue is this line: https://github.com/techascent/tech.ml.dataset/blob/b0896cc6116ad6aa049fb7f1b955e9fe49b07ae8/src/tech/v3/libs/tribuo.clj#L126

in which the code converts keyword,numbers and Strings to "string", and forgets about initial type. Probably we need to remember" the original type in some way, and convert back after prediction.

behrica avatar May 14 '24 17:05 behrica