cuda-python CI: frequency of hitting timeout/network errors has significantly increased recently

This can happen during

pip install
- Ex: #483
fetching artifacts from GitHub
- Ex: https://github.com/NVIDIA/cuda-python/actions/runs/13623473149/job/38077154585#step:10:219

Mar 03 '25 04:03 leofang

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14048031704?pr=517

Mar 25 '25 01:03 leofang

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14087083558/job/39461464660?pr=503

It took 4 reruns until all tests passed.

The current situation is quite disruptive, especially if I need to weed out real failures. The general issues are akin to decoys.

Mar 26 '25 17:03 rwgk

We've observed no more network issues lately! According to @ajschmidt8:

Most likely moving the V100s from RDS Lab to NVKS resolved the network issues. The NVKS cluster is in a different networking environment that seems much more stable than RDS Lab. Hopefully it stays that way!

Apr 22 '25 17:04 leofang