Job `cmake_*` broken due to dependecy issue
🐛 Describe the bug
The cmake_windows_cpu job started failing on 20220912 with:
Specifications:
- pytorch=1.13.0.dev20220912 -> python[version='>=3.10,<3.11.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0']
Your python: python=3.7
If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.
The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
Package intel-openmp conflicts for:
numpy -> mkl[version='>=2021.4.0,<2022.0a0'] -> intel-openmp[version='2021.*|2022.*']
pytorch=1.13.0.dev20220912 -> mkl[version='>=2018'] -> intel-openmp[version='2021.*|2022.*']
pytorch=1.13.0.dev20220912 -> intel-openmp
Package ca-certificates conflicts for:
python=3.7 -> openssl[version='>=1.1.1n,<1.1.2a'] -> ca-certificates
numpy -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
Package mkl conflicts for:
numpy -> mkl[version='>=2018.0.0,<2019.0a0|>=2018.0.1,<2019.0a0|>=2018.0.2,<2019.0a0|>=2018.0.3,<2019.0a0|>=2019.1,<2021.0a0|>=2019.3,<2021.0a0|>=2019.4,<2021.0a0|>=2021.2.0,<2022.0a0|>=2021.3.0,<2022.0a0|>=2021.4.0,<2022.0a0|>=2019.4,<2020.0a0']
numpy -> mkl_random -> mkl[version='>=2020.1,<2021.0a0']
Versions
Latest main for TorchVision, PyTorch Core 20220912
cc @seemethere
It was fixed by @atalman yesterday but it broke again today. See latest failure.
The cmake_windows_cpu was fixed on the 20220919 but unfortunately the cmake_macos_cpu broke now.
@atalman @malfet @seemethere I understand that this issue pops up when the Core nightly releases break. Can we do anything to avoid breaking our CI (perhaps keep the older nightly available)? It's been happening quite often the last few weeks. Thanks!
@datumbox @malfet looks like this issue is related to: https://github.com/pytorch/pytorch/issues/85085 Which related to our nightly failures, here an example: https://github.com/pytorch/pytorch/actions/runs/3088067813/jobs/4995149208
error during conda upload:
Error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error: Process completed with exit code 1.
Hence missing binary makes conda find wrong package or not find package at all as in this case. I think to mitigate this issue we should implement retry for conda uploads
Bumping priority on this issue, since it required for nightly and CI/CD stability
The problem was fixed again yesterday by @atalman but broke again today. See:
- https://app.circleci.com/pipelines/github/pytorch/vision/20449/workflows/af45d07b-293f-4d16-83c8-9ebbc8998d60/jobs/1663608
- https://app.circleci.com/pipelines/github/pytorch/vision/20449/workflows/af45d07b-293f-4d16-83c8-9ebbc8998d60/jobs/1663584
Today the following is failing: https://app.circleci.com/pipelines/github/pytorch/vision/20545/workflows/59f0125b-d7f1-4e30-962b-c2fbbd59086a/jobs/1671677
I think one simple solution is to relax the requirement on TorchVision side to always have nightly corresponding to the day CI is running, but use older ones if necessary (But say throw an error if nightly is older than 2 weeks)
@malfet It's a tricky tradeoff. I think we might need the expertise of RelEng and Core devs on this choice.
99% of problems recorded in these series of issues is because some web request fails and can't download dependencies. It typically happens to 1 or 2 binaries at the time, with the rest working fine. So on our current situation, only a fraction of binaries from core are expected to fail. So, if a real breaking change occurs, because our tests run fully in the entire matrix, we would be able to capture it even if one or two binaries are old.
There are risks associated though:
- @YosuaMichael is moving towards reducing the amount of tests we run on the matrix to speed up the tests and reduce costs. If one of them gets broken, then we risk not knowing about a significant change on Core that might be hard to revert 2 weeks later.
- It's possible that an exotic breakage exists that is visible only on a very specific configuration. One example of such a breakage was on the dispatching of TorchVision kernels in M1 (see #6152).
On one hand, this proposal seems low risk given our current setup and will remove the burden of dealing with these issues every other week. On the other hand, it can lead to big breakages that might undo the aforementioned time savings. There is also the option of doing a hybrid approach of retrying to download the dependencies multiple times within the day and at the same time do what you said with reduced time window (3 days instead of 2 weeks). I would love to hear your thoughts on pros/cons here.
@datumbox The regression regarding the job retry should be fixed by https://github.com/pytorch/pytorch/pull/85545
@atalman indeed it looks like it. Thanks a lot for the patch. Shall we close this issue and if it reoccurs reopen?