Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

[Prototype] Run tests in parallel

Open jlamypoirier opened this issue 9 months ago • 0 comments

✨ Description

Allows running tests in parallel and using all the available gpus so we can run lots of tests fast. Pytest-xdist is already relatively good, but puts everything in the first GPU(s) and risks causing OOMs, port conflicts and other issues. I made a simple allocation and locking mechanism to prevent such issues, adapted from pytest-xdist-lock.

The system comes in a few steps:

  • Test request a certain amount of gpus, gpu memory and ports through the get_test_resources mark or a specialized decorator such as requires_cuda.
  • The lock adapter safely allocates the gpu(s). It sets the default device to the first allocated one and restrict gpu usage through set_per_process_memory_fraction (5 GB by default for requested devices, 0 for other gpus), which is good enough for many tests.
  • For simple tests nothing more is needed, but more complex ones need to know the allocated gpus and ports, which they get through the get_test_resources fixture. This include fast-llm runs and distributed configs, for which I added config options and the get_distributed_config fixture, and Megatron runs which use CUDA_VISIBLE_DEVICES.
  • Once the test is done, the lock adapter checks that the allocation was respected, ensures that the GPU memory is de-allocated, and unlock the resources for other tests.

What remains is to ensure that dependencies between tests are respected (i.e. that pytest-xdist and pytest-depends are compatible enough), and that shared resource files (ex. test dataset) are parallel-safe.

I got things to a relatively stable state up to ~20 workers, but things start to break above it. It's still enough to reduce slow tests from 8 minutes to ~2 minutes, most of which comes from parallel overhead (~1 minute) and the slowest test (~40 s), so it adds room for lots of extra tests.

🔍 Type of change

Select all that apply:

  • [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • [ ] 🚀 New feature (non-breaking change that adds functionality)
  • [ ] ⚠️ Breaking change (a change that could affect existing functionality)
  • [x] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • [ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
  • [x] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

jlamypoirier avatar May 16 '25 23:05 jlamypoirier