Functional Tests for Ext-Profiler Plugin
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal"
What were the changes?
Added a set of functional tests for the Profiler plugin with support for below operations:
Collective Operations:
- AllReduce
- Broadcast
- Reduce
- ReduceScatter
P2P Operations:
- AllGather
- AllToAll
- SendRecv
Why were the changes made?
The tests are added to verify the behavior of the Profiler Plugin across different configurations, topologies, and operations.
How was the outcome achieved?
The tests cover below scenarios:
- Profiler initialization and basic functionality
- Invalid Event Mast Value
- Single-node detailed profiling
- Multi-node detailed profiling
Additional Documentation:
The test suite includes a detailed README with setup instructions, test execution commands, and pytest markers for selective test running
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
In the test files under test/ext-plugins/tests/ext-profiler each test has a "# Remove any existing trace files" section. Would it be worth to look at @pytest.fixture to provide the common code across tests in a file, in one place.
Looking at the validation steps after the execution of the function under test, there is enough variations that it may not be possible or easy to roll them into the fixture as post run step.
Looking at the P2PTests.cpp test, please check with @atulkulk as he has done some extensive changes to this file and moved it to a different location in order to support the multi instance MPI tests. I see the changes to this file as part of this PR is minimum but you may want to keep an eye out for the major change that @atulkulk has made as part of his efforts.