[Epic] Kubeflow 1.4 upgrade
closes #681 #1094
All tasks to do with the code will be added
Code Changes
Kubeflow - central dashboard
- Base work of commit on https://github.com/StatCan/daaas/issues/833#issuecomment-1016527311
- [x] ~~I18n is going to be an issue. The fact that upstream changed direction to go with the Angular i18n instead of the one when they merged our PR. We need to make an decision as to which way we want to go.~~ Wrong language, the central-dashboard is written in pug and that was not pushed upstream. This angular i18n is actually for jupyter-web-app.
- [x] Configmap will, as usual need to be double checked to use out items.
- [x] Do we want tensorboard?
Jupyter-web-apps
- [ ] Image but about the overwriting of the logos
- [ ] Is there any feature we need to add to our backend? The commit for 1.3 was https://github.com/StatCan/jupyter-apis/commit/863a996748b37e94aae013ea9de3d640af530423 something similar will need to be done.
Volume web apps?
Since last time we seem to have ignored the volume-web-app in favor of our own customized volume table in jupyter-web-app, we do not need to do any customization. Unless we decide to revert the decision.
Pipelines
-
~Our current release of pipelines seems to be 1.2. Recommended version with Kubeflow 1.4 is Pipelines 1.7~
-
[ ] Decision about if we want to upstream the i18n of pipelines. Might want to get someone on this and get it merged upstream. Regardless of our 1.4 upgrade.
Manifest
Overview
This manifests folder exactly matches the upstream Kubeflow Manifests repository in its naming and folder hierarchy.
- Kubeflow Manifests
- [ ] Need to copy the manifest from the 1.4 branch of kubeflow (added tensorborad)
Note: We are pushing all of the work into the
aaw-dev-cc-00branch for aaw-kubeflow-manifests.
Post Deploy Tasks
- Pipelines: Check the Cluster Roles are sufficient for Pipelines + Argo Workflow (Archive, Delete, Run, Experiments)
- Profiles: Check the new Access Management KFAM works without our KFAM adjustments
Common
| Component | Local Manifests Path | Upstream | Issue | AAW Sign-off | CNS Sign-off | Notes |
|---|---|---|---|---|---|---|
| kubeflow-namespace | common/kubeflow-namespace | v1.4.1 |
#198 | No structural changes. | ||
| kubeflow-roles | common/kubeflow-roles | v1.4.1 |
#199 | No structural changes. | ||
| oidc-authservice | common/oidc-authservice | v1.4.1 |
#200 | No structural changes. | ||
| kubeflow-knative | common/knative | v1.4.1 |
#201 | No structural changes. |
I think anything that is not direct folder equivalent is in the knative folder
Apps
| Component | Local Manifests Path | Upstream | Issue | AAW Sign-off | CNS Sign-off | Notes |
|---|---|---|---|---|---|---|
| admission-webhook | apps/admission-webhook | v1.4.1 |
#202 | Was not in the list | ||
| central-dashboard | apps/centraldashboard | v1.4.1 |
#203 | No structural changes. | ||
| jupyter-web-apps | apps/jupyter-web-app | v1.4.1 |
#204 | Named jupyter upstream? Or equivalent to jupyter + volume + tensorboard | ||
| katib | apps/katib | v1.4.1 |
#210 | No structural changes. | ||
| kfp-tekton | v1.4.1 |
New | ||||
| kfserving | apps/kfserving | v1.4.1 |
#211 | No structural changes. | ||
| kubebench | v1.4.1 |
New | ||||
| mpi-job | apps/mpi-job | v1.4.1 |
#212 | SAME, but is moved in 1.5.1 like the other *-job apps. | ||
| mxnet-job | apps/mxnet-job | v1.4.1 |
Changed Upstream see apps/training-operator | |||
| notebook-controller | apps/notebook-controller | v1.4.1 |
#216 | Our custom controller? | ||
| pipeline | apps/pipeline | v1.4.1 |
#221 | No structural changes. | ||
| profiles | apps/profiles | v1.4.1 |
#213 | No structural changes. | ||
| pytorch-job | apps/pytorch-job | v1.4.1 |
Changed Upstream see apps/training-operator | |||
| tensorboard | v1.4.1 |
New or different upstream - see jupyter-web-apps | ||||
| tf-training | apps/tf-training | v1.4.1 |
Deleted or different upstream | |||
| training-operations | v1.4.1 |
New! See Changed Upstream apps/training-operator | ||||
| volume-web-apps | v1.4.1 |
New or different upstream - see jupyter-web-apps |
Contrib
| Component | Local Manifests Path | Upstream | Issue | AAW Sign-off | CNS Sign-off |
|---|---|---|---|---|---|
| spark-operator | apps/spark-operator | v1.4.1 |
|||
| seldon | contrib/seldon | v1.4.1 |
The following are in 1.4.1, and were also in 1.3.1 and we don't have them. maybe we don't use them. - TO confirm
- application
- basic-auth
- dex-auth
- experimental
- feast
- flink
- gatekeeper
- modeldb/base
- spark - Same as spark-operator???
- spartakus
- tektoncd
Containers
We provide our own Kubeflow Notebooks that are updated continuously:
- k8scc01covidacr.azurecr.io/rstudio:
<sha> - k8scc01covidacr.azurecr.io/jupyterlab-cpu:
<sha> - k8scc01covidacr.azurecr.io/jupyterlab-pytorch:
<sha> - k8scc01covidacr.azurecr.io/jupyterlab-tensorflow:
<sha> - k8scc01covidacr.azurecr.io/remote-desktop:
<sha>
The following are the Kubeflow components we override in order to add features such as i18n and improved performance:
| Container | Kubeflow Component | Comparison | AAW Sign-off | CNS Sign-off |
|---|---|---|---|---|
| oidc-authservice | oidc-authservice | [compare-oidc-authservice] | ||
| centraldashboard | centraldashboard | [compare-centraldashboard] | ||
| jupyter-apis | jupyter-web-app | [compare-jupyter-apis] | ||
| kubeflow-pipelines | ml-pipeline/frontend | [compare-kubeflow-pipelines] |
Previous Epic
EPIC Kubeflow Upgrade Planning v1.3.1
Final Stretch: Core Upgrade Checkpoint for 1.4
In the interest of time, we will focus on upgrading the core components of Kubeflow.
We will finish upgrading these components to 1.4 first:
- admission-webhook
- notebook-controller
- profiles
- volume-web-apps
At the same time we'll need to do some heavy lifting for Jupyter Web Apps:
- jupyter-web-apps
- Frontend: @wg102
- Backend: @Collinbrown95
Note: send PRs to kf-1.4-upgrade
Finally, once those tickets are complete, we can ask @sylus to review and apply manifests to dev cluster.
Then... upgrade to 1.6
If you can please follow the exact method this was done for Kubeflow Upgrade to 1.3.x so it doesn't get out of hand like last time.
The more methodical approach worked really well ^_^
https://github.com/StatCan/aaw-kubeflow-manifests/issues/110
A useful link to compare the version for 1.4 is : https://www.kubeflow.org/docs/releases/kubeflow-1.4/
Will @rohank07 be helping with this since he knows how we rendered the manifests to check delta etc. ^_^
I won't be actively on KF 1.4 upgrade. But if you want to view the output of the rendered manifest, the command in taskfile.yaml
task stack:aaw:preview should do the trick to view the output manifest and help with debugging.
Ideally someone that worked on it before with me would be active on it, bit confused about that.
Anyways this is a 2 week task, if it looks like it might be longer I'd bring in @rohank07 that worked on it previously to speed things up :)
Closing since we are going directly to 1.6 See https://github.com/StatCan/daaas/issues/1337 issue which replaces this one