seansong

Results 44 comments of seansong

I also noticed that because of the all-reduce before the forward pass, it's not recommended to use FSDP for inference. Does this mean FSDP inference isn't supported so far by...

@mreso Thank you for looking into this issue. In the meantime, is there a workaround for using finetuned FSDP checkpoints for inference? Thanks.

@mreso Is there an update on this? Thanks

Could we prioritize this? if the checkpoints don't work how can we use the fine-tuned FSDP checkpoint for inference?

Hey, @mreso I found this only happens for llama3 and 3.1 models. inference with checkpoints from FSDP llama2 is ok. the arch of llama3 and llama2 are pretty similar. do...

Thank @wukaixingxp for fixing this. For some reason I got this issue `rocessing dataset: 0%| | 0/49402 [00:00

@wukaixingxp I tried both of the meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Llama-3.1-8B-Instruct. both of them have the same issue as before. Here is my steps: ![image](https://github.com/user-attachments/assets/984a65c1-0871-424a-82f8-ee92454c63ab) ``` python ./src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path ./fsdp_fine_tune_results/fsdp_model_finetuned_1_8_8B/fine-tuned-Meta-Llama-3-8B-Instruct --consolidated_model_path ./fsdp_fine_tune_results/fsdp_model_finetuned_1_8_hf...

@wukaixingxp Thanks for the updates. I can't find the `meta-llama/Meta-Llama-3.1-8B-Instruct` model card on hugging face. But there is `https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct`. I wonder are they the same model with different names?

@wukaixingxp Thanks for the prompt reply. here is the command I used: I used slurm. > srun -l docker exec -w /root/ fsdp torchrun --nnodes 1 --nproc_per_node 8 --rdzv_id 1599...