aaw icon indicating copy to clipboard operation
aaw copied to clipboard

Test and fix KF 1.6 core components

Open rohank07 opened this issue 3 years ago • 4 comments

Test Central-dashboard, Pipelines and JWA. Document and open tickets for issues that arise.

rohank07 avatar Sep 21 '22 12:09 rohank07

Notebook Controller

  • Referencing upstream notebook controller resolved Go issues. Got rid of custom notebook controller which added a readiness probe. https://github.com/StatCan/aaw-kubeflow-manifests/commit/9cfdd6372fb789d549f6d5d2c19c63b2b02c94cc

CentralDashBoard

  • Modify AuthorizationPolicy remove labels https://github.com/StatCan/aaw-kubeflow-manifests/commit/673479d579a4000508161495f5f3403b13deec90

Training-Operator

  • Adjust the memory and cpu limits and requests https://github.com/StatCan/aaw-kubeflow-manifests/commit/29e7bea365c4c07853a3bb4b8c8ea995c3d17c7e

rohank07 avatar Sep 21 '22 18:09 rohank07

The following was added manually I think I will put them in aaw-kubeflow-manifests but just confirming first.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: centraldashboard
  namespace: kubeflow
spec:
  host: centraldashboard.kubeflow.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: profiles-kfam
  namespace: kubeflow
spec:
  host: profiles-kfam.kubeflow.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

In addition these were the manual steps to get knative upgraded:

 krm mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.serving.knative.dev
 krm validatingwebhookconfigurations.admissionregistration.k8s.io validation.webhook.serving.knative.dev
 kubectl create -f ../../../serving-storage-version-migration.yaml (v0.18.3)
 kubectl create -f ../../../eventing-post-install.yaml (v0.23.0)
 // wait for job to finish
 kubectl delete -f ../../../eventing-post-install.yaml
 kubectl create -f ../../../eventing-post-install.yaml (v1.2.4)
 // wait for job to finish
 // re-sync ArgoCD

sylus avatar Sep 23 '22 14:09 sylus

Currently Blocked by: https://github.com/StatCan/daaas/issues/1359 and https://github.com/StatCan/kubeflow-pipelines/issues/46

rohank07 avatar Sep 28 '22 14:09 rohank07

Only encountered this on a few namespaces (myself, jose-matsuda, wendy-gaultier) but appears servicerolebindings, named user-x-x-clusterrole-edit on other ns exist. https://github.com/StatCan/daaas/issues/1381

rohank07 avatar Oct 12 '22 15:10 rohank07

Issues encountered post KF 1.6 upgrade

  • upstream connect error or disconnect/reset before headers. reset reason: connection failure - with the restart of workloads and upgrades to node pools this caused the vault agent pod to throw connection issues where it was failing to authenticate. A simple restart of the deployment resolved it.
  • Installing Conda packages: Users were encountering connection issues (Server Connection Error) when installing Conda packages in their Jupyter notebooks. The J-Frog pods were on a node that was very slow. To resolve this issues, CNS restarted the node. There was also issues with the vault-agent pod when it got upgraded. Conda downloads were using emphermal storage (everything outside of /home/jovyan ) There was a ENV set in the upgraded version of Vault-agent AGENT_INJECT_EPHEMERAL_LIMIT- this value was too low. CNS rolled back to a version that did not have this ENV.
  • GPU workloads were not starting up - The VMs would start but would not get registered with Kubernetes. CNS applied a minor K8s upgrade to fix a driver race condition. This was working temporarily but then they started failing again. VMs reported a failure. Azure Services fixed this for us.
  • Unable to attach or mount volumes: unmounted volumes - Azure scaleset was broken. ScaleSet was stuck in failing for the deletion of Virtual Machines and kept requeueing. Open an Azure support ticket and got this resolved.

rohank07 avatar Nov 15 '22 13:11 rohank07