amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

Model parallel v2 llama finetuning notebook fixes

Open ArjunKrishnak opened this issue 1 year ago • 5 comments

Description of changes:

  • Updating the model parallel v2 README to clarify usage of shared-scripts directory
  • Disabling fp8 by default for backward compatibility
  • Updating llma finetuning example with inline comments for FSX args and upgrade command for pytest

Testing done: Ran smp-finetuning-llama-fsdp-tp.ipynb in sagemaker notebook and ensured sagemaker training job succeded

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • [x] I have read the CONTRIBUTING doc and adhered to the example notebook best practices
  • [x] I have updated any necessary documentation, including READMEs
  • [x] I have tested my notebook(s) and ensured it runs end-to-end
  • [x] I have linted my notebook(s) and code using black-nb -l 100 {path}/{notebook-name}.ipynb

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ArjunKrishnak avatar Apr 29 '24 17:04 ArjunKrishnak

You use something like: ON_SIT|...|LINKMSG|206|.... ? And it hapens with the V3.10 slave script?

AFAIK: We didn't touch the slave script for a long time (beside some compatibility changes for openSim) so I guess V3.00 and perhaps even V2.x is affected?

Can we test it with the "animesh slave script", because all (important) messages are queued and so it shouldn't be affected ...

We have seen something very similar while changing poses in the new alpha system for animesh adjusters.

Can you give me a hint to reproduce this?

LeonaMorro avatar Mar 12 '19 13:03 LeonaMorro

I tested in V3.00 and yes it does happen there. It is probably in V2.01 but did not test. I believe that prior would not be an issue as card contents are not cached. Things are being run much faster now from cache.

With the new animesh issue, There were 4 sitters using SCHMO lines. The first 2 sitters were fine and sometimes the 3rd was also fine but not always. The 4 sitter just would not work properly. When poses were changed, the AV's would change to the new animation but would not move to the proper locations. Another user reports they have not seen this with their usage. I will see if I have that build in my inventory.

HowardBaxton avatar Mar 12 '19 16:03 HowardBaxton