[Bug] train error
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
The memory usage isn’t that high, but it still throws an OutOfMemoryError. What’s the issue?
error:
[rank3]: self._make_launchers() [rank3]: File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 573, in _make_launchers [rank3]: raise RuntimeError(f"No valid triton configs. {type(exc).name}: {exc}") [rank3]: torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused_0 Required: 107008 Hardware limit:101376 Reducing block sizes or num_stages may help.
Reproduction
torchrun
--standalone
--nproc_per_node 8
scripts/train_eagle3_online.py
--target-model-path /models/Qwen3-14B
--train-data-path dataset/opc_train_100000.jsonl
--output-dir outputs/Qwen3-14B-eagle3
--num-epochs 10
--batch-size 1
--learning-rate 1e-4
--max-length 2048
--chat-template qwen
--cache-dir cache
--tp-size 8
2>&1 | tee train.log
Environment
8X L20
can you try the following change? If this works on L20. I will raise a PR to fix the kernel options.
kernel_options = {
"BLOCK_M": 32,
"BLOCK_N": 32,
"BLOCK_M1": 32,
"BLOCK_N1": 32,
"BLOCK_M2": 32,
"BLOCK_N2": 32,
}
attn_output = flex_attention_func(
query=query_states,
key=key_cache.contiguous(),
value=value_cache.contiguous(),
block_mask=block_mask,
enable_gqa=True,
kernel_options=kernel_options,
)
yep !I tried it, and it runs properly now. Thank you.
[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break: from user code at:
[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/specforge/core/eagle3.py", line 777, in _compute_metric_acc
[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] ).sum().item() / (loss_mask.sum().item() + 1e-6)
[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0]
[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0]
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break from Tensor.item(), consider setting:
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] torch._dynamo.config.capture_scalar_outputs = True
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] or:
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] to include these operations in the captured graph.
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0]
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break: from user code at:
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/specforge/core/eagle3.py", line 777, in _compute_metric_acc
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] ).sum().item() / (loss_mask.sum().item() + 1e-6)
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0]
[rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s]
Train Epoch [1/10], position 0, Acc: 0.00
Train Epoch [1/10], position 1, Acc: 0.00
Train Epoch [1/10], position 2, Acc: 0.00
Train Epoch [1/10], position 3, Acc: 0.00
Train Epoch [1/10], position 4, Acc: 0.00
Train Epoch [1/10], position 5, Acc: 0.00
Train Epoch [1/10], position 6, Acc: 0.00
Train Epoch [1/10], position 0, pLoss: 8.15
Train Epoch [1/10], position 1, pLoss: 8.15
Train Epoch [1/10], position 2, pLoss: 8.15
Train Epoch [1/10], position 3, pLoss: 8.15
Train Epoch [1/10], position 4, pLoss: 8.15
Train Epoch [1/10], position 5, pLoss: 8.15
Train Epoch [1/10], position 6, pLoss: 8.15
Is this normal? The accuracy is 0.
@ggg-s I suspect the torch compile of flex attention has issues with L20. Can you use --attention-backend sdpa four your training?
@ggg-s I suspect the torch compile of flex attention has issues with L20. Can you use --attention-backend sdpa four your training?
Okay. I will try it.
can you try the following change? If this works on L20. I will raise a PR to fix the kernel options.
kernel_options = { "BLOCK_M": 32, "BLOCK_N": 32, "BLOCK_M1": 32, "BLOCK_N1": 32, "BLOCK_M2": 32, "BLOCK_N2": 32, } attn_output = flex_attention_func( query=query_states, key=key_cache.contiguous(), value=value_cache.contiguous(), block_mask=block_mask, enable_gqa=True, kernel_options=kernel_options, )
On the 4090, I successfully started training using the modification above, and --attention-backend sdpa also started successfully. Will these two methods have any impact on the results?