Changjiang GOU issues

Results 6 issues of


                                            Changjiang GOU

It takes too long to compile the train and eval process when reproducing muNet on 8 A100 GPUs

Hi there, I am reproducing the muNet on 8 A100 GPUs. Compared to running it on Colab TPUv2 8 cores, it takes too long to compile each child model. XLA...

[BUG]: an instance of 'c10::CUDAErrorc10::CUDAError' initialization error

### 🐛 Describe the bug I encountered this problem when running the examples/language/gpt/titans/train_gpt.py using real data provided by the example. This probelm only occurs when we set the argument 'num_workers'...

bug

Why the second matrix of the mlp layer has the same shape of the first one?

It's more a question than an issue. The tensor [w2](https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/mlp.py#L341C9-L341C50) of class SparseMLP has the same shape as the w1, is it because of the DSD operation? like, it requires...

question

[BUG]: Cannot copy out of meta tensor; no data!

Hi dear torchrec developers. I found a fatal bug when using EmbeddingCollection. The full stack is ``` [rank0]: File "/home/admin/hippo/worker/slave/aop_418921_aop_launcher_job_temp_m_20250528093245_6524584_job.worker_0_57_12/train/test_ebd.py", line 44, in [rank0]: main() [rank0]: File "/home/admin/hippo/worker/slave/aop_418921_aop_launcher_job_temp_m_20250528093245_6524584_job.worker_0_57_12/train/test_ebd.py", line 36,...

[Documentation] more accurate discription of behavior of KeyedJaggedTensor

I found an interesting phenomenon that could be enhanced when using KeyedJaggedTensor. ``` import torch from torchrec.sparse.jagged_tensor import JaggedTensor, KeyedJaggedTensor values = [ torch.Tensor([1.0]), torch.Tensor(), torch.Tensor([7.0, 8.0]), torch.Tensor([10.0, 11.0, 12.0]),...

the training crashes at logging_steps when running on a single GPU

### System Info - `transformers` version: 4.57.3 - Platform: Linux-6.6.97+-x86_64-with-glibc2.35 - Python version: 3.11.11 - Huggingface_hub version: 0.36.0 - Safetensors version: 0.7.0 - Accelerate version: 1.12.0 - Accelerate config: not...

bug