Antoni-Joan Solergibert comments

Results 14 comments of


                                            Antoni-Joan Solergibert

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

@XinDongol Why would you shuffle the dataset with that seed? Now that [Stateful DataLoaders](https://github.com/pytorch/torchtitan/pull/279) will merge soon, you won't be able to resume training from a crash properly because you...

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

Thanks for your answer @tianyu-l , it makes sense 😅 I was wondering, any idea to not use `.skip()` when resuming training? In my setup (& colab), skipping 10000000 samples...

[Metrics] Can't find / acquire lock files in distributed multi-node shared file system

Any news respect to this issue? Experimenting same behaviour during multinode training with a shared filesystem. Toni

linux-gnu.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv

Check [4.](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features). I solved this issue adding the `--no-build-isolation` flag!

RoPE implementation differences

Hi @rlrs ! Could you share the script to transform the weights from HF to dcp? Thanks!

Not able to run Zephyr 7B Gemma with 4 80GB A100s

I've just find out that it works IF YOU INSTALL the dependencies as point 1 of [this post](https://www.philschmid.de/fine-tune-google-gemma). I've run the following to set up the environment: ``` pip install...

Megatron-LM for LLaMa3

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

Megatron-LM for LLaMa3

> > There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model > > This doesn't work directly because this model file can get...

[ENHANCEMENT] DualPipeV support？

Hi @lk137095576, just adding my two cents here: Regarding (3): The TE changes were merged a few months ago in [this PR](https://github.com/NVIDIA/TransformerEngine/pull/1653). They look quite similar to what's in [lhb's...

[Feature] Enable CUDNN Attention

Hi @drisspg, thanks for your quick response! The error trace is as follows: ``` [rank0]:[titan] 2025-04-03 13:32:26,173 - root - INFO - Training starts at step 1. [rank0]:/iopsstor/scratch/cscs/asolergi/mm/torchtitan/torchtitan/models/llama/model.py:260: UserWarning: Memory...