training-operator
training-operator copied to clipboard
Distributed ML Training and Fine-Tuning on Kubernetes
This is the tracking issue for Training Operator 1.8 release. The feature freeze date for the next Kubeflow 1.9 release is April 15th. We are targeting the following features for...
**What this PR does / why we need it**: This is the fix cause by this [PR](https://github.com/kubeflow/common/pull/207), the minMember may be updated when the number of replica is changed. However,...
This issue tracks the Kubeflow 1.9 documentation deliverables for the new Fine-Tune APIs for LLMs. - [ ] Write intro doc with high-level presentation and user stories. @StefanoFioravanzo Draft to...
**What this PR does / why we need it**: **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the issue(s) when PR gets merged)_: Fixes...
**What this PR does / why we need it**: Fixed sorting issue in xgboost folder's imports **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close...
…g/api **What this PR does / why we need it**: sorted py files in training-operator/sdk/python/kubeflow/training/api **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the...
…rsions < 3.10 **What this PR does / why we need it**: **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the issue(s) when PR...
**what's the problem** Match case syntax is not compatible with earlier Python versions. https://github.com/kubeflow/training-operator/blob/f8f7363eb905757e7c05321ec8df81aed61cf6c6/sdk/python/kubeflow/storage_initializer/storage.py#L7-L28 **what's the fix?** Use if-else instead.
**what happened?** ```shell $ mypy mypy sdk/python/kubeflow/training/constants/constants.py | grep -i ":87" sdk/python/kubeflow/training/constants/constants.py:87: error: Module has no attribute "V1Volume" [attr-defined] ``` **what did you expect?** No style errors. **is there a...
### **Steps to reproduce:** 1. Set the PyTorchJob restartPolicy: ExitCode 2. Set backoffLimit > 1 3. Have a container exit with a non-zero exit code greater than 128 ### **Observed...