cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

CI: frequency of hitting timeout/network errors has significantly increased recently

Open leofang opened this issue 10 months ago • 2 comments

This can happen during

  • pip install
    • Ex: #483
  • fetching artifacts from GitHub
    • Ex: https://github.com/NVIDIA/cuda-python/actions/runs/13623473149/job/38077154585#step:10:219

leofang avatar Mar 03 '25 04:03 leofang

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14048031704?pr=517

leofang avatar Mar 25 '25 01:03 leofang

xref: https://github.com/NVIDIA/cuda-python/actions/runs/14087083558/job/39461464660?pr=503

It took 4 reruns until all tests passed.

The current situation is quite disruptive, especially if I need to weed out real failures. The general issues are akin to decoys.

rwgk avatar Mar 26 '25 17:03 rwgk

We've observed no more network issues lately! According to @ajschmidt8:

Most likely moving the V100s from RDS Lab to NVKS resolved the network issues. The NVKS cluster is in a different networking environment that seems much more stable than RDS Lab. Hopefully it stays that way!

leofang avatar Apr 22 '25 17:04 leofang