Antoni-Joan Solergibert
Antoni-Joan Solergibert
@XinDongol Why would you shuffle the dataset with that seed? Now that [Stateful DataLoaders](https://github.com/pytorch/torchtitan/pull/279) will merge soon, you won't be able to resume training from a crash properly because you...
Thanks for your answer @tianyu-l , it makes sense 😅 I was wondering, any idea to not use `.skip()` when resuming training? In my setup (& colab), skipping 10000000 samples...
Any news respect to this issue? Experimenting same behaviour during multinode training with a shared filesystem. Toni
Check [4.](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features). I solved this issue adding the `--no-build-isolation` flag!
Hi @rlrs ! Could you share the script to transform the weights from HF to dcp? Thanks!
I've just find out that it works IF YOU INSTALL the dependencies as point 1 of [this post](https://www.philschmid.de/fine-tune-google-gemma). I've run the following to set up the environment: ``` pip install...
There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model
> > There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model > > This doesn't work directly because this model file can get...
Hi @lk137095576, just adding my two cents here: Regarding (3): The TE changes were merged a few months ago in [this PR](https://github.com/NVIDIA/TransformerEngine/pull/1653). They look quite similar to what's in [lhb's...
Hi @drisspg, thanks for your quick response! The error trace is as follows: ``` [rank0]:[titan] 2025-04-03 13:32:26,173 - root - INFO - Training starts at step 1. [rank0]:/iopsstor/scratch/cscs/asolergi/mm/torchtitan/torchtitan/models/llama/model.py:260: UserWarning: Memory...