Add `fit_transform`?
See discussion at #16
I think for transformers it would make sense to require exactly two methods:
-
fit_transform -
transform
No need for a separate fit implementation for transformers. When an ML pipeline is first fit, all of the transformers in the pipeline have to do a fit and a transformation, so I don't think fit needs to be separate. And of course using fit_transform allows for optimizations, as @davidbp mentioned.
The signature of fit_transform would probably look like this:
fitted_transformer, Xout = fit_transform(transformer, Xin)
Thanks for that suggestion.
| No need for a separate fit implementation for transformers.
I dunno. It seems like a non-trivial complication to the API. No separate fit means some special-casing in model composition. In MLJ we expect every model to have a fit and this is pretty central to the learning network stuff. (If you can stomach it, see our paper here). Conceptually, predict and transform are treated very similarly - they're just functions depending on a learned parameter that you generate with fit.
If fit_transform is just sugar for transform(fit(...)) then I don't think it's justified in a basic interface. Every name we add to the namespace should work hard to justify it's existence. Can you think of a use case where the optimisations gained are significant? I can imagine one avoids some data conversions (like DataFrame -> matrix -> DataFrame), but with the right "data front end" (which I'm still thinking about) this issue would have a workaround.
[I know some have argued for just one "operation" (predict, say) for added simplification, but this goes too far, in my view. In the current LearnAPI proposal, predict is distinguished by the fact that the ouput is a (proxy for a ) "target", a general notion we make reasonably precise in the docs, and we enable dispatch on the type of proxy. transform need not have this interpretation, but can have an "inverse". As we see from sk-learn and MLJ, allowing algorithms to implement more than one operationpredict / transform /inverse_transform is both natural and useful.]
Okay, here's a variation on your idea that doesn't require adding to the namespace. Each transformer implements one transform and one fit:
Case 1: static (non-generalizing) transformers
fit(strategy, X) -> model # storing `transformed_X` and any inspectable byproducts of algorithm
transform(model) -> model.transformed_X
with a convenience fallback
transform(strategy, X) = transform(fit(strategy, X))
Case 2: generalizing transformers:
fit(strategy, X) -> model # storing `transformed_X` and `learned_parameters` and any inspectable byproducts of algorithm
transform(model, Xnew) -> transformed_Xnew # uses `model.learned_parameters`
with a convenience fallback
transform(strategy, X) = transform(fit(strategy, X).transformed_X)
I'm not sure we'd want to keep a reference to an intermediate transformed data set in a trained transformer. That would prevent the garbage collector from freeing that memory as long as the pipeline is still around.
It also feels conceptually a little muddy, but that's just a feeling that I haven't been able to put into more concrete terms yet. :)
Case 2: generalizing transformers:
fit(strategy, X) -> model # storing `learned_parameters` and any inspectable byproducts of algorithm
transform(model, Xnew) -> transformed_Xnew # uses `model.learned_parameters`
with a convenience fallback
transform(strategy, X) = transform(fit(strategy, X), X)
Hmm, that form doesn't allow for optimizations, but, as you said, maybe there's not really that many fit-then-transform cases that get a large benefit from optimizations.
In #30 an implementation can explicitly overload transform(strategy, data) to provide a one-shot method with no issues. I think providing a universal fallback is a bad idea, as it could lead to type instabilities, and confusion in debugging ("hidden knowledge").
On dev, a learner can implement transform(learner, X), as shorthand for transform(fit(learner, X), X) or transform(fit(learner), X) for "static models".