[Prototype] Run tests in parallel

Open jlamypoirier opened this issue 9 months ago • 0 comments

✨ Description

Allows running tests in parallel and using all the available gpus so we can run lots of tests fast. Pytest-xdist is already relatively good, but puts everything in the first GPU(s) and risks causing OOMs, port conflicts and other issues. I made a simple allocation and locking mechanism to prevent such issues, adapted from pytest-xdist-lock.

The system comes in a few steps:

Test request a certain amount of gpus, gpu memory and ports through the get_test_resources mark or a specialized decorator such as requires_cuda.
The lock adapter safely allocates the gpu(s). It sets the default device to the first allocated one and restrict gpu usage through set_per_process_memory_fraction (5 GB by default for requested devices, 0 for other gpus), which is good enough for many tests.
For simple tests nothing more is needed, but more complex ones need to know the allocated gpus and ports, which they get through the get_test_resources fixture. This include fast-llm runs and distributed configs, for which I added config options and the get_distributed_config fixture, and Megatron runs which use CUDA_VISIBLE_DEVICES.
Once the test is done, the lock adapter checks that the allocation was respected, ensures that the GPU memory is de-allocated, and unlock the resources for other tests.

What remains is to ensure that dependencies between tests are respected (i.e. that pytest-xdist and pytest-depends are compatible enough), and that shared resource files (ex. test dataset) are parallel-safe.

I got things to a relatively stable state up to ~20 workers, but things start to break above it. It's still enough to reduce slow tests from 8 minutes to ~2 minutes, most of which comes from parallel overhead (~1 minute) and the slowest test (~40 s), so it adds room for lots of extra tests.

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[ ] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[x] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[x] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

May 16 '25 23:05 jlamypoirier