CI upgrade plan (base driver and ROCm)
Originated from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1148#issuecomment-998255498
I recommend updating nodes to the most recent released ROCm version and corresponding kernel driver. 5.0 is not released yet, I would either wait or use 4.5.2.
NOTE :warning: ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible. Using newer ROCm than base driver is allowed; but an opposite combo must be avoided. That is why this process includes updates of the Dockerfile.
We must be careful to avoid CI malfunction due to, for example, failures of static checks.
Proposed plan:
- [ ] Devote one Navi21 node (re-label it to
rocmtest-5.0, for example) - [ ] Upgrade base installation of ROCm and kernel driver
- [ ] Create new "upgrade" branch (e.g.
wip-rocmtest-5.0-upgrade). Try it to see if it is passing on that node, as is.- :warning: Failures are possible (see note above)
- [ ] Modify "upgrade" branch: update Dockerfile to use new ROCm version and make sure it passes all the tests
- [ ] Modify "upgrade" branch: include static checks and fix MIOpen until it passes tests
- [ ] Devote one gfx906 node, one MI100 node and one MI200 node to the CI upgrade process. Upgrade base installation of ROCm and kernel driver on these nodes.
- [ ] Update vega10, vega20, mi200 and mi200 trial branches from
wip-rocmtest-5.0-upgradeand make sure that all tests pass.- [Note] All fixes from these trial nodes must be integrated back to the "upgrade" branch (e.g.
wip-rocmtest-5.0-upgrade).
- [Note] All fixes from these trial nodes must be integrated back to the "upgrade" branch (e.g.
- [ ] :red_circle: Inform everyone and shutdown CI (e.g this can be done by re-labeling all the nodes with
rocmtest-5.0). - [ ] Upgrade remaining nodes
- [ ] Merge
wip-rocmtest-5.0-upgradeintodevelopand make sure it passes CI. - [ ] Merge
developinto all trial branches and test all nodes. - [ ] :green_circle: Ask everyone to merge
developinto their development branches.
/cc @junliume @JehandadKhan @pfultz2 @okakarpa @jbakhrai
@atamazov 4.5.2 in dockerfiles seem to have no problems on the current CI pipelines (while base OS ROCm are not updated): https://github.com/ROCmSoftwarePlatform/MIOpen/commits/jenkins-ci-rocm-4.5 http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jenkins-ci-rocm-4.5/9/pipeline
@junliume Thanks for sharing this! AFAICS jenkins-ci-rocm-4.5 ae6c67fe succeeded only after the 3rd attempt, so I wouldn't say that there is no problems at all. And yes, we can upgrade only dockerfiles (like we often did in the past), but this is not what this ticket is about. The idea is updating the base driver because this may become important for the new GPUs (line Navi21 or MI200).
Of course I am not against updating docker to 4.5.2. This does not require any kind of special plan.
@jbakhrai could you align with @okakarpa for the best time window for this task? Thanks!
ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible
The rocm versions should be backwards-compatible(except releases where the ABI is changed like in rocm 5.0). So we can use rocm 4.5 in the docker container.
What is the rationale for needing to upgrade to 4.5 on the bare machines? Why cant we wait for rocm 5.0 release to upgrade all the CI nodes?
There is a pretty long process to upgrade all the nodes, and we will definitely need to do this for 5.0 since it might have ABI changes we need to use 5.0 in docker containers. It doesn't make sense to do this now and then turnaround and upgrade for 5.0. We might not even finish all this testing for 4.5 when we need to start the upgrades for 5.0.
@pfultz2
What is the rationale for needing to upgrade to 4.5 on the bare machines? Why cant we wait for rocm 5.0 release to upgrade all the CI nodes?
Some of the Navi21 nodes already upgraded to 4.5, but docker stays at 4.3.1, which might led to stabiltiy issues. But I think we can wait with this ticket (full CI upgrade) and only update docker to 4.5.2.
ROCm used for testing (in the docker container) must be of the same version as the base installation; otherwise failures are possible
The rocm versions should be backwards-compatible(except releases where the ABI is changed like in rocm 5.0). So we can use rocm 4.5 in the docker container.
Of course. That is why this ticket is saying, in the next statement:
Using newer ROCm than base driver is allowed...
Note that the main goal of this ticket is to inform everyone about "full CI upgrade" plan, regardless of ROCm version.
@atamazov Is this ticket still relevant? Thanks!
@ppanchad-amd I believe so, as it proposes the secure and almost seamless process for updating CI nodes. However, let @junliume make the final decision (and re-label the ticket as appropriate).