Kelly A
Kelly A
On reviewing this again, there's some possible solutions I can think of: 1. [On this code that checks if backofff limit is exceeded](https://github.com/kubeflow/training-operator/blob/5b2c6c8943fe6a1f8803f268f71ca714316fa6bc/pkg/core/job.go#L95), have it look for job restart events...
If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit...
You'll want to preface your model name with the provider. I.e. for [VLLM](https://docs.litellm.ai/docs/providers/vllm) and llama3 it would be: model="hosted_vllm/meta-llama/Llama-3.3-70B-Instruct"