SpecForge [Bug] train error

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

The memory usage isn’t that high, but it still throws an OutOfMemoryError. What’s the issue?

error： [rank3]: self._make_launchers() [rank3]: File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 573, in _make_launchers [rank3]: raise RuntimeError(f"No valid triton configs. {type(exc).name}: {exc}") [rank3]: torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused_0 Required: 107008 Hardware limit:101376 Reducing block sizes or num_stages may help.

Reproduction

torchrun
--standalone
--nproc_per_node 8
scripts/train_eagle3_online.py
--target-model-path /models/Qwen3-14B
--train-data-path dataset/opc_train_100000.jsonl
--output-dir outputs/Qwen3-14B-eagle3
--num-epochs 10
--batch-size 1
--learning-rate 1e-4
--max-length 2048
--chat-template qwen
--cache-dir cache
--tp-size 8
2>&1 | tee train.log

Environment

8X L20

Aug 30 '25 08:08 ggg-s

can you try the following change? If this works on L20. I will raise a PR to fix the kernel options.

        kernel_options = {
            "BLOCK_M": 32,
            "BLOCK_N": 32,
            "BLOCK_M1": 32,
            "BLOCK_N1": 32,
            "BLOCK_M2": 32,
            "BLOCK_N2": 32,
        }
        attn_output = flex_attention_func(
            query=query_states,
            key=key_cache.contiguous(),
            value=value_cache.contiguous(),
            block_mask=block_mask,
            enable_gqa=True,
            kernel_options=kernel_options,
        )

Aug 30 '25 09:08 yubofredwang

yep ！I tried it, and it runs properly now. Thank you.

Aug 30 '25 09:08 ggg-s

[rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break: from user code at: [rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/specforge/core/eagle3.py", line 777, in _compute_metric_acc [rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] ).sum().item() / (loss_mask.sum().item() + 1e-6) [rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] [rank1]:W0830 09:42:25.606000 56327 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break from Tensor.item(), consider setting: [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] torch._dynamo.config.capture_scalar_outputs = True [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] or: [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1 [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] to include these operations in the captured graph. [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Graph break: from user code at: [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] File "/opt/miniconda/envs/sglang/lib/python3.12/site-packages/specforge/core/eagle3.py", line 777, in _compute_metric_acc [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] ).sum().item() / (loss_mask.sum().item() + 1e-6) [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] [rank7]:W0830 09:42:25.667000 56333 site-packages/torch/_dynamo/variables/tensor.py:1047] [7/0] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Training Epoch 0: 100%|██████████| 100000/100000 [5:00:17<00:00, 5.55it/s] Train Epoch [1/10], position 0, Acc: 0.00 Train Epoch [1/10], position 1, Acc: 0.00 Train Epoch [1/10], position 2, Acc: 0.00 Train Epoch [1/10], position 3, Acc: 0.00 Train Epoch [1/10], position 4, Acc: 0.00 Train Epoch [1/10], position 5, Acc: 0.00 Train Epoch [1/10], position 6, Acc: 0.00 Train Epoch [1/10], position 0, pLoss: 8.15 Train Epoch [1/10], position 1, pLoss: 8.15 Train Epoch [1/10], position 2, pLoss: 8.15 Train Epoch [1/10], position 3, pLoss: 8.15 Train Epoch [1/10], position 4, pLoss: 8.15 Train Epoch [1/10], position 5, pLoss: 8.15 Train Epoch [1/10], position 6, pLoss: 8.15

Is this normal? The accuracy is 0.

Aug 30 '25 15:08 ggg-s

@ggg-s I suspect the torch compile of flex attention has issues with L20. Can you use --attention-backend sdpa four your training?

Aug 31 '25 06:08 yubofredwang

@ggg-s I suspect the torch compile of flex attention has issues with L20. Can you use --attention-backend sdpa four your training?

Okay. I will try it.

Aug 31 '25 06:08 ggg-s

can you try the following change? If this works on L20. I will raise a PR to fix the kernel options.

        kernel_options = {
            "BLOCK_M": 32,
            "BLOCK_N": 32,
            "BLOCK_M1": 32,
            "BLOCK_N1": 32,
            "BLOCK_M2": 32,
            "BLOCK_N2": 32,
        }
        attn_output = flex_attention_func(
            query=query_states,
            key=key_cache.contiguous(),
            value=value_cache.contiguous(),
            block_mask=block_mask,
            enable_gqa=True,
            kernel_options=kernel_options,
        )

On the 4090, I successfully started training using the modification above, and --attention-backend sdpa also started successfully. Will these two methods have any impact on the results?

Sep 08 '25 07:09 ggg-s