[Roadmap] Multiple outputs.
Since XGBoost 1.6, we have been working on having multi-output support for the tree model. In 2.0, we will have the initial implementation for the vector-leaf-based multi-output model. This issue serves as a tracker for future development and related discussion. The original feature request is here: https://github.com/dmlc/xgboost/issues/2087 . The related features are for vector-leaf rather than general multi-output.
Feel free to share your suggestions or make related feature requests in the comments.
Implementation Optimization
- [ ] Use f-order for the gradient. Currently, the gradient has one column for each target but is written in C-order. The transformation takes about one-fifth of the training time. (#9508)
- [x] Use f-order for the custom objective. (#9089)
- [x] Improve array type dispatching by moving the dispatch logic from per-element to per-array. This enables us to have a more efficient custom objective interface. (#9090)
Algorithmic Optimization
We are still looking for potential algorithmic optimization for vector-leaf and here's the pool of candidates. We need to survey all available options. Feel free to share if you have ideas or paper recommendations.
- [ ] Sketch boost.
- [ ] https://arxiv.org/abs/2201.06239
- [ ] Extra tree.
(#11798)
GPU Implementation
- [ ] Evaluation (#11781)
- [ ] Histogram (#11781)
- [x] Prediction (#11752)
- [ ] Prediction cache.
- [ ] Model (#11277)
- [ ] Partition. (#11789)
- [ ] Gradient sampling.
Documentation
- [ ] Derive the approximated Hessian in the context of boosting trees.
Multi-task
- [ ] Multi-task xgboost. This is not yet decided. I think it's wise to at least do some exploration before forging the rest of the implementation since we will have a very different interface if we need to consider multi-task. Related: https://github.com/dmlc/xgboost/issues/7693 .
Features
- [ ] Tree SHAP
- [x] Plotting (https://github.com/dmlc/xgboost/pull/10093)
- [x] Model text dump (JSON, txt, graphviz) (#10093, #11747)
- [ ] Tree data frame.
- [ ] Categorical feature.
- [ ] Constraints
- [ ] Approx tree method
- [ ] Exact tree method
- [ ] Loss weight
- [ ] Feature importance (be careful with tree index) (https://github.com/dmlc/xgboost/pull/10700)
- [x] Intercept. (#11656)
Learning to rank
We can have a ranking model to consider multiple criteria. This might require multi-task to be supported.
Quantile regression
- [ ] l1
- [ ] quantile
Distributed
- [ ] Dask
- [ ] PySpark
- [ ] Spark
- [ ] Flink?
- [ ] Federated (https://github.com/dmlc/xgboost/pull/9171)
Binding
- [ ] R (https://github.com/dmlc/xgboost/pull/9526)
- [ ] Scala
- [x] Python
- [ ] Java
- [ ] C
HPO
- [ ] Check compatibility with major HPO frameworks.
Other extensions
- [ ] Sparse label. (multi-label classification optimization)
- [ ] Missing label.
- [ ] Early stopping for each target?
Applications
- https://arxiv.org/abs/2210.06831
- [ ] FIL
Benchmarks
- [ ] Collection of datasets for future comparison.
Hi, great work on the initial multitarget implementation!
Given the roadmap when can we expect GPU support for multi output regression? When this support is added will xgboost-ray also support it?
Hi @CarloLepelaars ,
- For model-per-target, it's already implemented.
- For vector leaf, it will take some more work, but eventually yes. I don't have an eta for when it will be available.
- I think there is on-going work on the ray-xgboost, but you will need to open an issue on that repository for concrete answers.
Hi, very nice work! I am wondering how SHAP should be used for multi-output models, e.g. how to explain links between the Ys, and how to interpret the effects of Xs - e.g., which Xs display common effects across the Ys, and which Xs display differential effects. Do you know a good example of using SHAP for a multi-output model?
For model per target, it's the same as single target. As for vector leaf, I haven't looked into it yet, but no significant difference on top of my mind.
I am currently toying with multitargets approach ... I have a hard time defining a custom metric (haven't tried custom loss). Preds seems to be of size (len(y) x len(targets)) while y_true is of shape (len(y), len(targets)), I have managed to handle this internally to my metric to return one value. But now I have an error about an output being a tuple instead of a number. Any way to handle this properly or is it too early ?
Hi.
Did anybody train the multiple outputs XGBoost model on Mac arm64 machine?
On recent stable version I have got error:
XGBoostError('[...] Check failed: !trees.front()->IsMultiTarget(): Update tree leaf support for multi-target tree is not yet implemented.
On latest nightly version xgboost-2.1.0.dev0+a7226c02223246be78a59c3a4e8c32d1c68c1ff9 - I have managed load CPU, but it was no feedback on terminal window.
Is the vector-leaf-based multi-output model still work in progress ? Also what research paper based on which splitting mechanism for decision trees is working for this ? @trivialfis
yes, it's still working in progress.
Hi @trivialfis,
I'm currently working on some models using XGBoostLSS which as far as I understand is based on the multi-output feature of XGBoost. I wonder how monotonic constraints are considered in the multi-ouput case ? It seems constraints are shared among trees built for each target, could you confirm ?
Thanks for your work on this feature !