vision Job `cmake_*` broken due to dependecy issue

🐛 Describe the bug

The cmake_windows_cpu job started failing on 20220912 with:

Specifications:

  - pytorch=1.13.0.dev20220912 -> python[version='>=3.10,<3.11.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0']

Your python: python=3.7

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package intel-openmp conflicts for:
numpy -> mkl[version='>=2021.4.0,<2022.0a0'] -> intel-openmp[version='2021.*|2022.*']
pytorch=1.13.0.dev20220912 -> mkl[version='>=2018'] -> intel-openmp[version='2021.*|2022.*']
pytorch=1.13.0.dev20220912 -> intel-openmp

Package ca-certificates conflicts for:
python=3.7 -> openssl[version='>=1.1.1n,<1.1.2a'] -> ca-certificates
numpy -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates

Package mkl conflicts for:
numpy -> mkl[version='>=2018.0.0,<2019.0a0|>=2018.0.1,<2019.0a0|>=2018.0.2,<2019.0a0|>=2018.0.3,<2019.0a0|>=2019.1,<2021.0a0|>=2019.3,<2021.0a0|>=2019.4,<2021.0a0|>=2021.2.0,<2022.0a0|>=2021.3.0,<2022.0a0|>=2021.4.0,<2022.0a0|>=2019.4,<2020.0a0']
numpy -> mkl_random -> mkl[version='>=2020.1,<2021.0a0']

Versions

Latest main for TorchVision, PyTorch Core 20220912

cc @seemethere

Sep 13 '22 15:09 datumbox

It was fixed by @atalman yesterday but it broke again today. See latest failure.

Sep 14 '22 11:09 datumbox

The cmake_windows_cpu was fixed on the 20220919 but unfortunately the cmake_macos_cpu broke now.

@atalman @malfet @seemethere I understand that this issue pops up when the Core nightly releases break. Can we do anything to avoid breaking our CI (perhaps keep the older nightly available)? It's been happening quite often the last few weeks. Thanks!

Sep 20 '22 11:09 datumbox

@datumbox @malfet looks like this issue is related to: https://github.com/pytorch/pytorch/issues/85085 Which related to our nightly failures, here an example: https://github.com/pytorch/pytorch/actions/runs/3088067813/jobs/4995149208

error during conda upload:

Error:  ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error: Process completed with exit code 1.

Hence missing binary makes conda find wrong package or not find package at all as in this case. I think to mitigate this issue we should implement retry for conda uploads

Sep 21 '22 01:09 atalman

Bumping priority on this issue, since it required for nightly and CI/CD stability

Sep 21 '22 01:09 atalman

The problem was fixed again yesterday by @atalman but broke again today. See:

https://app.circleci.com/pipelines/github/pytorch/vision/20449/workflows/af45d07b-293f-4d16-83c8-9ebbc8998d60/jobs/1663608
https://app.circleci.com/pipelines/github/pytorch/vision/20449/workflows/af45d07b-293f-4d16-83c8-9ebbc8998d60/jobs/1663584

Sep 22 '22 11:09 datumbox

Today the following is failing: https://app.circleci.com/pipelines/github/pytorch/vision/20545/workflows/59f0125b-d7f1-4e30-962b-c2fbbd59086a/jobs/1671677

Sep 23 '22 16:09 datumbox

I think one simple solution is to relax the requirement on TorchVision side to always have nightly corresponding to the day CI is running, but use older ones if necessary (But say throw an error if nightly is older than 2 weeks)

Sep 26 '22 15:09 malfet

@malfet It's a tricky tradeoff. I think we might need the expertise of RelEng and Core devs on this choice.

99% of problems recorded in these series of issues is because some web request fails and can't download dependencies. It typically happens to 1 or 2 binaries at the time, with the rest working fine. So on our current situation, only a fraction of binaries from core are expected to fail. So, if a real breaking change occurs, because our tests run fully in the entire matrix, we would be able to capture it even if one or two binaries are old.

There are risks associated though:

@YosuaMichael is moving towards reducing the amount of tests we run on the matrix to speed up the tests and reduce costs. If one of them gets broken, then we risk not knowing about a significant change on Core that might be hard to revert 2 weeks later.
It's possible that an exotic breakage exists that is visible only on a very specific configuration. One example of such a breakage was on the dispatching of TorchVision kernels in M1 (see #6152).

On one hand, this proposal seems low risk given our current setup and will remove the burden of dealing with these issues every other week. On the other hand, it can lead to big breakages that might undo the aforementioned time savings. There is also the option of doing a hybrid approach of retrying to download the dependencies multiple times within the day and at the same time do what you said with reduced time window (3 days instead of 2 weeks). I would love to hear your thoughts on pros/cons here.

Sep 26 '22 16:09 datumbox

@datumbox The regression regarding the job retry should be fixed by https://github.com/pytorch/pytorch/pull/85545

Sep 26 '22 20:09 atalman

@atalman indeed it looks like it. Thanks a lot for the patch. Shall we close this issue and if it reoccurs reopen?

Sep 26 '22 21:09 datumbox