Kelly A comments

Repositories
Issues
Comments

Results 3 comments of


                                            Kelly A

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

On reviewing this again, there's some possible solutions I can think of: 1. [On this code that checks if backofff limit is exceeded](https://github.com/kubeflow/training-operator/blob/5b2c6c8943fe6a1f8803f268f71ca714316fa6bc/pkg/core/job.go#L95), have it look for job restart events...

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit...

Vllm or Huggingface for local LLMs for CrewAI

You'll want to preface your model name with the provider. I.e. for [VLLM](https://docs.litellm.ai/docs/providers/vllm) and llama3 it would be: model="hosted_vllm/meta-llama/Llama-3.3-70B-Instruct"