RL icon indicating copy to clipboard operation
RL copied to clipboard

fix: remove sft-qwen2.5-fsdp2tp8sp from nighlies

Open ahmadki opened this issue 5 months ago β€’ 1 comments

What does this PR do ?

sequence parallel + tp_size >1 is currently broken in torch==2.8.0, no point in running the test as it will raise an exception

Summary by CodeRabbit

  • Tests
    • Disabled a test pending resolution of a known compatibility issue.

✏️ Tip: You can customize this high-level summary in your review settings.

ahmadki avatar Nov 20 '25 21:11 ahmadki

πŸ“ Walkthrough

Walkthrough

A single test case for sft-qwen2.5-32b in the nightly test suite is disabled with a TODO note referencing a torch==2.8.0 compatibility issue. The actual test invocation line is removed and replaced with a commented explanation.

Changes

Cohort / File(s) Summary
Test suite maintenance
tests/test_suites/nightly.txt
Disabled functional 32-bit test for sft-qwen2.5-32b; removed test invocation and added TODO comment noting torch==2.8.0 blocker

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

  • Single-file change with minimal scope
  • Simple documentation/disabling of a test case
  • Clear explanation provided via TODO comment

Possibly related PRs

  • NVIDIA-NeMo/RL#1334: Overlapping edits to the same nightly test-suite file affecting the same 32B test entries.

Suggested labels

CI

Suggested reviewers

  • parthchadha
  • guyueh1
  • yfw

Pre-merge checks and finishing touches

βœ… Passed checks (4 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title 'fix: remove sft-qwen2.5-fsdp2tp8sp from nighlies' accurately describes the main change: disabling/removing a nightly test due to a torch 2.8.0 compatibility issue.
Docstring Coverage βœ… Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes βœ… Passed PR removes a broken test in nightly suite due to known torch==2.8.0 issue. Minor maintenance change, not a major code change requiring test results documentation.
✨ Finishing touches
πŸ§ͺ Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch ahmadki/disable_llm_sft_qwen2_5_32b_4n8g_fsdp2tp8sp_actckpt_v3

πŸ“œ Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 1c371a9b34a703ac66fdc2d80b536f5fad99e616 and 3e8526c433973a02e06a8786f5fe2078eb02f41a.

πŸ“’ Files selected for processing (1)
  • tests/test_suites/nightly.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
πŸ“š Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/nightly.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
πŸ”‡ Additional comments (1)
tests/test_suites/nightly.txt (1)

68-69: Clear documentation and correct pattern for test disablement.

The change correctly disables the sft-qwen2.5-32b test with a well-formatted TODO comment that explains the root cause and references the tracking issue. This follows the established pattern in the file for disabled tests (see lines 26–27, 48–49, 92–94).

The GitHub issue #652 exists and is open with the title "Qwen parallelizer with sequence parallelism," confirming the reference is valid and accessible.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Nov 20 '25 21:11 coderabbitai[bot]