aaw icon indicating copy to clipboard operation
aaw copied to clipboard

[Epic] Kubeflow 1.4 upgrade

Open chuckbelisle opened this issue 3 years ago • 5 comments

closes #681 #1094

All tasks to do with the code will be added

Code Changes

Kubeflow - central dashboard

  • Base work of commit on https://github.com/StatCan/daaas/issues/833#issuecomment-1016527311
  • [x] ~~I18n is going to be an issue. The fact that upstream changed direction to go with the Angular i18n instead of the one when they merged our PR. We need to make an decision as to which way we want to go.~~ Wrong language, the central-dashboard is written in pug and that was not pushed upstream. This angular i18n is actually for jupyter-web-app.
  • [x] Configmap will, as usual need to be double checked to use out items.
  • [x] Do we want tensorboard?

Jupyter-web-apps

  • [ ] Image but about the overwriting of the logos
  • [ ] Is there any feature we need to add to our backend? The commit for 1.3 was https://github.com/StatCan/jupyter-apis/commit/863a996748b37e94aae013ea9de3d640af530423 something similar will need to be done.

Volume web apps?

Since last time we seem to have ignored the volume-web-app in favor of our own customized volume table in jupyter-web-app, we do not need to do any customization. Unless we decide to revert the decision.

Pipelines

Manifest

Overview

This manifests folder exactly matches the upstream Kubeflow Manifests repository in its naming and folder hierarchy.

  • Kubeflow Manifests
  • [ ] Need to copy the manifest from the 1.4 branch of kubeflow (added tensorborad)

Note: We are pushing all of the work into the aaw-dev-cc-00 branch for aaw-kubeflow-manifests.

Post Deploy Tasks

  • Pipelines: Check the Cluster Roles are sufficient for Pipelines + Argo Workflow (Archive, Delete, Run, Experiments)
  • Profiles: Check the new Access Management KFAM works without our KFAM adjustments

Common

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off Notes
kubeflow-namespace common/kubeflow-namespace v1.4.1 #198 No structural changes.
kubeflow-roles common/kubeflow-roles v1.4.1 #199 No structural changes.
oidc-authservice common/oidc-authservice v1.4.1 #200 No structural changes.
kubeflow-knative common/knative v1.4.1 #201 No structural changes.

I think anything that is not direct folder equivalent is in the knative folder

Apps

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off Notes
admission-webhook apps/admission-webhook v1.4.1 #202 Was not in the list
central-dashboard apps/centraldashboard v1.4.1 #203 No structural changes.
jupyter-web-apps apps/jupyter-web-app v1.4.1 #204 Named jupyter upstream? Or equivalent to jupyter + volume + tensorboard
katib apps/katib v1.4.1 #210 No structural changes.
kfp-tekton v1.4.1 New
kfserving apps/kfserving v1.4.1 #211 No structural changes.
kubebench v1.4.1 New
mpi-job apps/mpi-job v1.4.1 #212 SAME, but is moved in 1.5.1 like the other *-job apps.
mxnet-job apps/mxnet-job v1.4.1 Changed Upstream see apps/training-operator
notebook-controller apps/notebook-controller v1.4.1 #216 Our custom controller?
pipeline apps/pipeline v1.4.1 #221 No structural changes.
profiles apps/profiles v1.4.1 #213 No structural changes.
pytorch-job apps/pytorch-job v1.4.1 Changed Upstream see apps/training-operator
tensorboard v1.4.1 New or different upstream - see jupyter-web-apps
tf-training apps/tf-training v1.4.1 Deleted or different upstream
training-operations v1.4.1 New! See Changed Upstream apps/training-operator
volume-web-apps v1.4.1 New or different upstream - see jupyter-web-apps

Contrib

Component Local Manifests Path Upstream Issue AAW Sign-off CNS Sign-off
spark-operator apps/spark-operator v1.4.1
seldon contrib/seldon v1.4.1

The following are in 1.4.1, and were also in 1.3.1 and we don't have them. maybe we don't use them. - TO confirm

  • application
  • basic-auth
  • dex-auth
  • experimental
  • feast
  • flink
  • gatekeeper
  • modeldb/base
  • spark - Same as spark-operator???
  • spartakus
  • tektoncd

Containers

We provide our own Kubeflow Notebooks that are updated continuously:

  • k8scc01covidacr.azurecr.io/rstudio:<sha>
  • k8scc01covidacr.azurecr.io/jupyterlab-cpu:<sha>
  • k8scc01covidacr.azurecr.io/jupyterlab-pytorch:<sha>
  • k8scc01covidacr.azurecr.io/jupyterlab-tensorflow:<sha>
  • k8scc01covidacr.azurecr.io/remote-desktop:<sha>

The following are the Kubeflow components we override in order to add features such as i18n and improved performance:

Container Kubeflow Component Comparison AAW Sign-off CNS Sign-off
oidc-authservice oidc-authservice [compare-oidc-authservice]
centraldashboard centraldashboard [compare-centraldashboard]
jupyter-apis jupyter-web-app [compare-jupyter-apis]
kubeflow-pipelines ml-pipeline/frontend [compare-kubeflow-pipelines]

Previous Epic

EPIC Kubeflow Upgrade Planning v1.3.1

Final Stretch: Core Upgrade Checkpoint for 1.4

In the interest of time, we will focus on upgrading the core components of Kubeflow.

We will finish upgrading these components to 1.4 first:

  • admission-webhook
  • notebook-controller
  • profiles
  • volume-web-apps

At the same time we'll need to do some heavy lifting for Jupyter Web Apps:

  • jupyter-web-apps
    • Frontend: @wg102
    • Backend: @Collinbrown95

Note: send PRs to kf-1.4-upgrade

Finally, once those tickets are complete, we can ask @sylus to review and apply manifests to dev cluster.

Then... upgrade to 1.6

chuckbelisle avatar Jun 22 '22 16:06 chuckbelisle

If you can please follow the exact method this was done for Kubeflow Upgrade to 1.3.x so it doesn't get out of hand like last time.

The more methodical approach worked really well ^_^

https://github.com/StatCan/aaw-kubeflow-manifests/issues/110

sylus avatar Jun 29 '22 15:06 sylus

A useful link to compare the version for 1.4 is : https://www.kubeflow.org/docs/releases/kubeflow-1.4/

wg102 avatar Jul 04 '22 14:07 wg102

Will @rohank07 be helping with this since he knows how we rendered the manifests to check delta etc. ^_^

sylus avatar Jul 07 '22 12:07 sylus

I won't be actively on KF 1.4 upgrade. But if you want to view the output of the rendered manifest, the command in taskfile.yaml task stack:aaw:preview should do the trick to view the output manifest and help with debugging.

rohank07 avatar Jul 07 '22 13:07 rohank07

Ideally someone that worked on it before with me would be active on it, bit confused about that.

Anyways this is a 2 week task, if it looks like it might be longer I'd bring in @rohank07 that worked on it previously to speed things up :)

sylus avatar Jul 07 '22 13:07 sylus

Closing since we are going directly to 1.6 See https://github.com/StatCan/daaas/issues/1337 issue which replaces this one

wg102 avatar Sep 12 '22 14:09 wg102