MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

CI upgrade plan (base driver and ROCm)

Open atamazov opened this issue 4 years ago • 8 comments

Originated from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1148#issuecomment-998255498

I recommend updating nodes to the most recent released ROCm version and corresponding kernel driver. 5.0 is not released yet, I would either wait or use 4.5.2.

NOTE :warning: ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible. Using newer ROCm than base driver is allowed; but an opposite combo must be avoided. That is why this process includes updates of the Dockerfile.

We must be careful to avoid CI malfunction due to, for example, failures of static checks.

Proposed plan:

  • [ ] Devote one Navi21 node (re-label it to rocmtest-5.0, for example)
  • [ ] Upgrade base installation of ROCm and kernel driver
  • [ ] Create new "upgrade" branch (e.g. wip-rocmtest-5.0-upgrade). Try it to see if it is passing on that node, as is.
    • :warning: Failures are possible (see note above)
  • [ ] Modify "upgrade" branch: update Dockerfile to use new ROCm version and make sure it passes all the tests
  • [ ] Modify "upgrade" branch: include static checks and fix MIOpen until it passes tests
  • [ ] Devote one gfx906 node, one MI100 node and one MI200 node to the CI upgrade process. Upgrade base installation of ROCm and kernel driver on these nodes.
  • [ ] Update vega10, vega20, mi200 and mi200 trial branches from wip-rocmtest-5.0-upgrade and make sure that all tests pass.
    • [Note] All fixes from these trial nodes must be integrated back to the "upgrade" branch (e.g. wip-rocmtest-5.0-upgrade).
  • [ ] :red_circle: Inform everyone and shutdown CI (e.g this can be done by re-labeling all the nodes with rocmtest-5.0).
  • [ ] Upgrade remaining nodes
  • [ ] Merge wip-rocmtest-5.0-upgrade into develop and make sure it passes CI.
  • [ ] Merge develop into all trial branches and test all nodes.
  • [ ] :green_circle: Ask everyone to merge develop into their development branches.

/cc @junliume @JehandadKhan @pfultz2 @okakarpa @jbakhrai

atamazov avatar Jan 12 '22 22:01 atamazov

@atamazov 4.5.2 in dockerfiles seem to have no problems on the current CI pipelines (while base OS ROCm are not updated): https://github.com/ROCmSoftwarePlatform/MIOpen/commits/jenkins-ci-rocm-4.5 http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jenkins-ci-rocm-4.5/9/pipeline

junliume avatar Jan 13 '22 00:01 junliume

@junliume Thanks for sharing this! AFAICS jenkins-ci-rocm-4.5 ae6c67fe succeeded only after the 3rd attempt, so I wouldn't say that there is no problems at all. And yes, we can upgrade only dockerfiles (like we often did in the past), but this is not what this ticket is about. The idea is updating the base driver because this may become important for the new GPUs (line Navi21 or MI200).

atamazov avatar Jan 13 '22 13:01 atamazov

Of course I am not against updating docker to 4.5.2. This does not require any kind of special plan.

atamazov avatar Jan 13 '22 13:01 atamazov

@jbakhrai could you align with @okakarpa for the best time window for this task? Thanks!

junliume avatar Jan 21 '22 03:01 junliume

ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible

The rocm versions should be backwards-compatible(except releases where the ABI is changed like in rocm 5.0). So we can use rocm 4.5 in the docker container.

What is the rationale for needing to upgrade to 4.5 on the bare machines? Why cant we wait for rocm 5.0 release to upgrade all the CI nodes?

There is a pretty long process to upgrade all the nodes, and we will definitely need to do this for 5.0 since it might have ABI changes we need to use 5.0 in docker containers. It doesn't make sense to do this now and then turnaround and upgrade for 5.0. We might not even finish all this testing for 4.5 when we need to start the upgrades for 5.0.

pfultz2 avatar Jan 21 '22 18:01 pfultz2

@pfultz2

What is the rationale for needing to upgrade to 4.5 on the bare machines? Why cant we wait for rocm 5.0 release to upgrade all the CI nodes?

Some of the Navi21 nodes already upgraded to 4.5, but docker stays at 4.3.1, which might led to stabiltiy issues. But I think we can wait with this ticket (full CI upgrade) and only update docker to 4.5.2.

ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible

The rocm versions should be backwards-compatible(except releases where the ABI is changed like in rocm 5.0). So we can use rocm 4.5 in the docker container.

Of course. That is why this ticket is saying, in the next statement:

Using newer ROCm than base driver is allowed...

Note that the main goal of this ticket is to inform everyone about "full CI upgrade" plan, regardless of ROCm version.

atamazov avatar Jan 21 '22 19:01 atamazov

@atamazov Is this ticket still relevant? Thanks!

ppanchad-amd avatar Apr 16 '24 14:04 ppanchad-amd

@ppanchad-amd I believe so, as it proposes the secure and almost seamless process for updating CI nodes. However, let @junliume make the final decision (and re-label the ticket as appropriate).

atamazov avatar Apr 23 '24 13:04 atamazov