CI: frequency of hitting timeout/network errors has significantly increased recently
This can happen during
- pip install
- Ex: #483
- fetching artifacts from GitHub
- Ex: https://github.com/NVIDIA/cuda-python/actions/runs/13623473149/job/38077154585#step:10:219
xref: https://github.com/NVIDIA/cuda-python/actions/runs/14048031704?pr=517
xref: https://github.com/NVIDIA/cuda-python/actions/runs/14087083558/job/39461464660?pr=503
It took 4 reruns until all tests passed.
The current situation is quite disruptive, especially if I need to weed out real failures. The general issues are akin to decoys.
We've observed no more network issues lately! According to @ajschmidt8:
Most likely moving the V100s from RDS Lab to NVKS resolved the network issues. The NVKS cluster is in a different networking environment that seems much more stable than RDS Lab. Hopefully it stays that way!