aaw Notebook server extremely slow to shut down

Describe the bug

Rarely I see a notebook server that when terminated, takes an extremely long time (a day or more) to delete. The most recent case is with a non-Protected B notebook server called npb1.

Environment info

Namespace: greenhouse-detection

Notebook/server: npb1

Steps to reproduce

This is not reliably reproducible (as far as I know), it rarely seems to occur whenever a notebook server is shut down. I don't know if there's some specific condition that occurs to make it reliably reproducible.

Expected behaviour

Shutting down a notebook server should take a few minutes at most, not hours or days.

Screenshots

Additional context

Here is the output of kubectl describe pod npb1-0, a bunch of stuff is redacted (I can send the full unredacted version to AAW maintainers if necessary):

Name:                      npb1-0
Namespace:                 greenhouse-detection
Priority:                  0
Node:                      REDACTED
Start Time:                Fri, 27 May 2022 15:29:31 +0000
Labels:                    access-ml-pipeline=true
                           controller-revision-hash=npb1-REDACTED
                           istio.io/rev=default
                           minio-mounts=true
                           notebook-name=npb1
                           security.istio.io/tlsMode=istio
                           service.istio.io/canonical-name=npb1
                           service.istio.io/canonical-revision=latest
                           statefulset=npb1
                           statefulset.kubernetes.io/pod-name=npb1-0
Annotations:               data.statcan.gc.ca/inject-boathouse: true
                           poddefault.admission.kubeflow.org/poddefault-access-ml-pipeline: REDACTED
                           poddefault.admission.kubeflow.org/poddefault-minio-mounts: REDACTED
                           prometheus.io/path: /stats/prometheus
                           prometheus.io/port: REDACTED
                           prometheus.io/scrape: true
                           sidecar.istio.io/status:
                             {"version":"REDACTED","initContainers":["istio-validation"],"containers":[REDACTED
Status:                    Terminating (lasts 18h)
Termination Grace Period:  30s
IP:                        REDACTED
Controlled By:             StatefulSet/npb1
Init Containers:
  istio-validation:
    REDACTED
Containers:
  npb1:
    Container ID:   
    Image:          REDACTED
    Image ID:       
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  4Gi
    Requests:
      cpu:     1
      memory:  4Gi
    Environment:
      REDACTED
    Mounts:
      REDACTED
  istio-proxy:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.7.8
    Image ID:      
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      REDACTED
    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
      REDACTED
    Mounts:
      REDACTED
  vault-agent:
    Container ID:  
    Image:         vault:1.7.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
    Args:
      echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  128Mi
    Requests:
      cpu:     250m
      memory:  64Mi
    Environment:
      VAULT_LOG_LEVEL:   info
      VAULT_LOG_FORMAT:  standard
      VAULT_CONFIG:      REDACTED
    Mounts:
      /home/vault from home-sidecar (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from REDACTED
      /vault/secrets from vault-secrets (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  workspace-npb1:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  workspace-npb1
    ReadOnly:   false
  REDACTED
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  home-sidecar:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  vault-secrets:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      Memory
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     data.statcan.gc.ca/classification=unclassified:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 node.statcan.gc.ca/purpose=user:NoSchedule
                 node.statcan.gc.ca/use=general:NoSchedule
Events:          <none>

Jun 01 '22 14:06 StanHatko

Maybe this is failing on unmounting the minio mounts.

For anyone that checks consult the kubelet logs on the node for the particular pod.

Jun 03 '22 17:06 sylus

The notebook server npb1 is still shutting down, even now.

Jun 06 '22 19:06 StanHatko

@brendangadd @cboin1996

Jun 06 '22 19:06 sylus