Parallel tests v2

Open jlamypoirier opened this issue 9 months ago • 0 comments

✨ Description

A simplified version of #273, where resources are allocated statically for each workers. This works fine, with some big caveats:

Multi-gpu tests and spawned processes run at random and ignore memory limits, so there can be OOMs with lots of workers. On my 4-gpu runs the limit is around 20 workers.
Tests with dependencies are skipped if they don't run in the same worker as their dependencies. I'm planning to fix this with smart scheduling in a follow-up PR. Skipped tests include most multi-gpu tests, which means the 20-worker limit above is likely overestimated. [Edit: found a really simple solution, working on it.]
Tests have a hard-coded memory limit of 5 GB (though spawned processes ignore it). All current test seems ok with this, so it's fine for now.

This PR isn't that useful by itself given the skipped tests, but it's a good step forward, and I suggest merging right away to keep PRs small, and do the rest in follow-up PRs.

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[x] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[x] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[x] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

May 28 '25 17:05 jlamypoirier