Michael Clifford comments

Results 14 comments of


                                            Michael Clifford

[Epic] Automated Data Science bootstrap from curated content sets

Not sure how you plan to implement this, but sounds like it would require the addition of an ever growing set of template repo's (is that right?). Have you considered...

[Epic] Automated Data Science bootstrap from curated content sets

> wouldn't that single repo become too big/complex? e.g. the NLP stack alone already has 4 overlays. Its certainly a trade off to consider. Managing 1 complex repo vs complexity...

Add time-to-merge prediction tool to the repository

@astoycos

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

@Sara-KS yes, `DDPJobDefinition._dry_run(cluster)` will generate the dry_run output.

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

There is a parameter in the `DDPJobDefiniton()` that allows you to define mounts. see https://github.com/project-codeflare/codeflare-sdk/blob/baec8585b2bd918becd030951bf43e3504d43ada/src/codeflare_sdk/job/jobs.py#L62C11-L62C11 And the syntax should be similar to how we handle `script_args'. So something like: ```...

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

nvm. sorry, @dfeddema just fully read your last comment and see that you still got errors with that approach.

Torchx support for local NVME drives not available (with PV/PVCs for local NVME drives) - feature required for expected performance with multi-node training

Since you are not using a Ray cluster for this I think you need to do the following to see the dry_run output. ``` jobdef = DDPJobDefinition(name="resnet50", script="pytorch/pytorch_imagenet_resnet50.py", script_args=arg_list, scheduler_args={"namespace":...

job.status and job.log returns error for certain domain name pattern

Thanks for pointing this out @tedhtchang. I recall we ran into this issue before, and we could not determine a regex that provide stable results, so we added a list...

API/CLI for ai-lab

@slemeur We've already started to document how to use our images for Continue https://github.com/containers/ai-lab-recipes/blob/main/recipes/natural_language_processing/code_generation/llms-vscode-integration.md But a tighter integration with AI Lab would be nice.

After 10-20 questions the chatbot app is no longer working

Do you get an error like this in the model server pod? ``` result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ � File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/llama_chat_format.py", line 247, in _convert_text_completion_chunks_to_chat for i, chunk in enumerate(chunks):...