[Prototype] Run tests in parallel
✨ Description
Allows running tests in parallel and using all the available gpus so we can run lots of tests fast. Pytest-xdist is already relatively good, but puts everything in the first GPU(s) and risks causing OOMs, port conflicts and other issues. I made a simple allocation and locking mechanism to prevent such issues, adapted from pytest-xdist-lock.
The system comes in a few steps:
- Test request a certain amount of gpus, gpu memory and ports through the
get_test_resourcesmark or a specialized decorator such asrequires_cuda. - The lock adapter safely allocates the gpu(s). It sets the default device to the first allocated one and restrict gpu usage through
set_per_process_memory_fraction(5 GB by default for requested devices, 0 for other gpus), which is good enough for many tests. - For simple tests nothing more is needed, but more complex ones need to know the allocated gpus and ports, which they get through the
get_test_resourcesfixture. This include fast-llm runs and distributed configs, for which I added config options and theget_distributed_configfixture, and Megatron runs which useCUDA_VISIBLE_DEVICES. - Once the test is done, the lock adapter checks that the allocation was respected, ensures that the GPU memory is de-allocated, and unlock the resources for other tests.
What remains is to ensure that dependencies between tests are respected (i.e. that pytest-xdist and pytest-depends are compatible enough), and that shared resource files (ex. test dataset) are parallel-safe.
I got things to a relatively stable state up to ~20 workers, but things start to break above it. It's still enough to reduce slow tests from 8 minutes to ~2 minutes, most of which comes from parallel overhead (~1 minute) and the slowest test (~40 s), so it adds room for lots of extra tests.
🔍 Type of change
Select all that apply:
- [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
- [ ] 🚀 New feature (non-breaking change that adds functionality)
- [ ] ⚠️ Breaking change (a change that could affect existing functionality)
- [x] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
- [ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
- [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
- [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
- [x] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)