training-operator issues

[Release] Training Operator 1.8 Roadmap

19

This is the tracking issue for Training Operator 1.8 release. The feature freeze date for the next Kubeflow 1.9 release is April 15th. We are targeting the following features for...

andreyvelich

release/1.8

fix volcano podgroup update issue

6

**What this PR does / why we need it**: This is the fix cause by this [PR](https://github.com/kubeflow/common/pull/207), the minMember may be updated when the number of replica is changed. However,...

ckyuto

size/XS

Fine-Tune APIs for LLM Documentation

5

This issue tracks the Kubeflow 1.9 documentation deliverables for the new Fine-Tune APIs for LLMs. - [ ] Write intro doc with high-level presentation and user stories. @StefanoFioravanzo Draft to...

StefanoFioravanzo

area/docs

release/1.8

fixed import sort order on api_client.py using isort

3

**What this PR does / why we need it**: **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the issue(s) when PR gets merged)_: Fixes...

anindyahepth

size/S

chore(fix): isort xgboost

2

**What this PR does / why we need it**: Fixed sorting issue in xgboost folder's imports **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close...

harshithbelagur

size/M

isorted all py files in training-operator/sdk/python/kubeflow/trainin…

5

…g/api **What this PR does / why we need it**: sorted py files in training-operator/sdk/python/kubeflow/training/api **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the...

miaozeyu

size/M

update match-case to if/else for backward compatibility for Python ve…

6

…rsions < 3.10 **What this PR does / why we need it**: **Which issue(s) this PR fixes** _(optional, in `Fixes #, #, ...` format, will close the issue(s) when PR...

2020ayao

size/M

do-not-merge/hold

fix(compatability): match-case syntax only compatible with Python3.10

5

**what's the problem** Match case syntax is not compatible with earlier Python versions. https://github.com/kubeflow/training-operator/blob/f8f7363eb905757e7c05321ec8df81aed61cf6c6/sdk/python/kubeflow/storage_initializer/storage.py#L7-L28 **what's the fix?** Use if-else instead.

PantherHawk

release/1.8

chore(style): provide type for `STORAGE_INITIALIZER_VOLUME` constant

2

**what happened?** ```shell $ mypy mypy sdk/python/kubeflow/training/constants/constants.py | grep -i ":87" sdk/python/kubeflow/training/constants/constants.py:87: error: Module has no attribute "V1Volume" [attr-defined] ``` **what did you expect?** No style errors. **is there a...

PantherHawk

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

9

### **Steps to reproduce:** 1. Set the PyTorchJob restartPolicy: ExitCode 2. Set backoffLimit > 1 3. Have a container exit with a non-zero exit code greater than 128 ### **Observed...

kellyaa

kind/feature

training-operator
training-operator copied to clipboard

Metadata

[Release] Training Operator 1.8 Roadmap

fix volcano podgroup update issue

Fine-Tune APIs for LLM Documentation

fixed import sort order on api_client.py using isort

chore(fix): isort xgboost

isorted all py files in training-operator/sdk/python/kubeflow/trainin…

update match-case to if/else for backward compatibility for Python ve…

fix(compatability): match-case syntax only compatible with Python3.10

chore(style): provide type for `STORAGE_INITIALIZER_VOLUME` constant

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

← Metadata

Owner

Metadata

training-operator training-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

training-operator
training-operator copied to clipboard