Test and fix KF 1.6 core components
Test Central-dashboard, Pipelines and JWA. Document and open tickets for issues that arise.
Notebook Controller
- Referencing upstream notebook controller resolved Go issues. Got rid of custom notebook controller which added a readiness probe. https://github.com/StatCan/aaw-kubeflow-manifests/commit/9cfdd6372fb789d549f6d5d2c19c63b2b02c94cc
CentralDashBoard
- Modify AuthorizationPolicy remove labels https://github.com/StatCan/aaw-kubeflow-manifests/commit/673479d579a4000508161495f5f3403b13deec90
Training-Operator
- Adjust the memory and cpu limits and requests https://github.com/StatCan/aaw-kubeflow-manifests/commit/29e7bea365c4c07853a3bb4b8c8ea995c3d17c7e
The following was added manually I think I will put them in aaw-kubeflow-manifests but just confirming first.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: centraldashboard
namespace: kubeflow
spec:
host: centraldashboard.kubeflow.svc.cluster.local
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: profiles-kfam
namespace: kubeflow
spec:
host: profiles-kfam.kubeflow.svc.cluster.local
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
In addition these were the manual steps to get knative upgraded:
krm mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.serving.knative.dev
krm validatingwebhookconfigurations.admissionregistration.k8s.io validation.webhook.serving.knative.dev
kubectl create -f ../../../serving-storage-version-migration.yaml (v0.18.3)
kubectl create -f ../../../eventing-post-install.yaml (v0.23.0)
// wait for job to finish
kubectl delete -f ../../../eventing-post-install.yaml
kubectl create -f ../../../eventing-post-install.yaml (v1.2.4)
// wait for job to finish
// re-sync ArgoCD
Currently Blocked by: https://github.com/StatCan/daaas/issues/1359 and https://github.com/StatCan/kubeflow-pipelines/issues/46
Only encountered this on a few namespaces (myself, jose-matsuda, wendy-gaultier) but appears servicerolebindings, named user-x-x-clusterrole-edit on other ns exist. https://github.com/StatCan/daaas/issues/1381
Issues encountered post KF 1.6 upgrade
-
upstream connect error or disconnect/reset before headers. reset reason: connection failure- with the restart of workloads and upgrades to node pools this caused the vault agent pod to throw connection issues where it was failing to authenticate. A simple restart of the deployment resolved it. - Installing Conda packages: Users were encountering connection issues (Server Connection Error) when installing Conda packages in their Jupyter notebooks. The J-Frog pods were on a node that was very slow. To resolve this issues, CNS restarted the node. There was also issues with the vault-agent pod when it got upgraded. Conda downloads were using emphermal storage (everything outside of
/home/jovyan) There was a ENV set in the upgraded version of Vault-agentAGENT_INJECT_EPHEMERAL_LIMIT- this value was too low. CNS rolled back to a version that did not have this ENV. - GPU workloads were not starting up - The VMs would start but would not get registered with Kubernetes. CNS applied a minor K8s upgrade to fix a driver race condition. This was working temporarily but then they started failing again. VMs reported a failure. Azure Services fixed this for us.
-
Unable to attach or mount volumes: unmounted volumes- Azure scaleset was broken. ScaleSet was stuck in failing for the deletion of Virtual Machines and kept requeueing. Open an Azure support ticket and got this resolved.