flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Demo cluster not starting on slow internet connection

Open nanohanno opened this issue 3 years ago • 7 comments

Describe the bug

Starting the demo cluster via flytectl demo start returned after ~5 min with this error message

+---------------------------------------------+---------------+-----------+
|                   SERVICE                   |    STATUS     | NAMESPACE |
+---------------------------------------------+---------------+-----------+
| flyte-kubernetes-dashboard-7fd989b99d-hgmqb | Pending       | flyte     |
+---------------------------------------------+---------------+-----------+
| minio-55b8c8f4bc-mvjz5                      | Pending       | flyte     |
+---------------------------------------------+---------------+-----------+
| postgres-bdb75f779-cngdp                    | Running       | flyte     |
+---------------------------------------------+---------------+-----------+
Error: Get "https://127.0.0.1:30086/api/v1/nodes": dial tcp 127.0.0.1:30086: connect: connection refused

Running flytectl demo exec -- kubectl describe pod -n flyte shows that the pending pods are pulling the image before it exits. Also, it turned out that my internet connection was slow but increasing $FLYTE_TIMEOUT as recommended in https://github.com/flyteorg/flyte/issues/2197 did not help.

Could it be that it fails while waiting for the deployments to be ready (https://github.com/flyteorg/flyte/blob/cf24edfbb8c55be5d29c96f7f6ba761ceb44003f/docker/sandbox-lite/flyte-entrypoint-dind.sh#L63) when it is still loading the image? The timeout --timeout=5m is not affected by changing $FLYTE_TIMEOUT I guess.

Expected behavior

The demo cluster starts succesfully.

Additional context to reproduce

  1. Slow internet connection :)
  2. flytectl demo start

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

nanohanno avatar Jul 26 '22 13:07 nanohanno

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Jul 26 '22 13:07 welcome[bot]

Hmm thank you for the bug, I wonder if it is one of the dockerhub images. Cc @evalsocket

kumare3 avatar Jul 27 '22 04:07 kumare3

It's not a bug, I just tested the demo start and it worked.

@nanohanno Can you please follow the troubleshooting docs https://docs.flyte.org/en/latest/community/troubleshoot.html#troubleshooting-guide and let us know what's happing in the cluster

yindia avatar Jul 27 '22 05:07 yindia

Today with a fast internet connection it works again. During yesterday's attempts to start it I did what is recommended in the trouble shooting guide. This report is from shortly before it exited with the message posted above with the difference that in this attempt other pods were in pending or running status .:

>>> flytectl demo exec -- kubectl describe pod -n flyte
Name:           postgres-bdb75f779-6cftr
Namespace:      flyte
Priority:       0
Node:           2d489bb89f76/172.17.0.2
Start Time:     Tue, 26 Jul 2022 13:28:03 +0000
Labels:         app.kubernetes.io/instance=flyte
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=postgres
                helm.sh/chart=flyte-deps-v1.1.0
                pod-template-hash=bdb75f779
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/postgres-bdb75f779
Containers:
  postgres:
    Container ID:   
    Image:          ecr.flyte.org/ubuntu/postgres:13-21.04_beta
    Image ID:       
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     10m
      memory:  128Mi
    Environment:
      POSTGRES_HOST_AUTH_METHOD:  trust
      POSTGRES_DB:                flyteadmin
    Mounts:
      /var/lib/postgresql/data from postgres-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5rh7f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  postgres-storage:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-5rh7f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m11s  default-scheduler  Successfully assigned flyte/postgres-bdb75f779-6cftr to 2d489bb89f76
  Normal  Pulling    2m4s   kubelet            Pulling image "ecr.flyte.org/ubuntu/postgres:13-21.04_beta"


Name:           flyte-kubernetes-dashboard-7fd989b99d-k4m8d
Namespace:      flyte
Priority:       0
Node:           2d489bb89f76/172.17.0.2
Start Time:     Tue, 26 Jul 2022 13:28:03 +0000
Labels:         app.kubernetes.io/component=kubernetes-dashboard
                app.kubernetes.io/instance=flyte
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kubernetes-dashboard
                app.kubernetes.io/version=2.2.0
                helm.sh/chart=kubernetes-dashboard-4.0.2
                pod-template-hash=7fd989b99d
Annotations:    seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/flyte-kubernetes-dashboard-7fd989b99d
Containers:
  kubernetes-dashboard:
    Container ID:  
    Image:         kubernetesui/dashboard:v2.2.0
    Image ID:      
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --namespace=flyte
      --metrics-provider=none
      --enable-skip-login
      --enable-insecure-login
      --disable-settings-authorizer
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  200Mi
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wcp7g (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flyte-kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-wcp7g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m11s  default-scheduler  Successfully assigned flyte/flyte-kubernetes-dashboard-7fd989b99d-k4m8d to 2d489bb89f76
  Normal  Pulling    2m4s   kubelet            Pulling image "kubernetesui/dashboard:v2.2.0"


Name:         minio-55b8c8f4bc-r4zb9
Namespace:    flyte
Priority:     0
Node:         2d489bb89f76/172.17.0.2
Start Time:   Tue, 26 Jul 2022 13:28:03 +0000
Labels:       app.kubernetes.io/instance=flyte
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=minio
              helm.sh/chart=flyte-deps-v1.1.0
              pod-template-hash=55b8c8f4bc
Annotations:  <none>
Status:       Running
IP:           10.42.0.3
IPs:
  IP:           10.42.0.3
Controlled By:  ReplicaSet/minio-55b8c8f4bc
Containers:
  minio:
    Container ID:   docker://0e6d5d4c9f4ddbe008e4b44f7ab076d42b81b0405bd497fcc81a84929b81b190
    Image:          ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0
    Image ID:       docker-pullable://ecr.flyte.org/bitnami/minio@sha256:547a4d0fdc82d5213fef3f4f7215fee788398238c5184a8555837bd3f649525e
    Ports:          9000/TCP, 9001/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Jul 2022 13:29:56 +0000
      Finished:     Tue, 26 Jul 2022 13:30:06 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Jul 2022 13:29:45 +0000
      Finished:     Tue, 26 Jul 2022 13:29:55 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     200m
      memory:  512Mi
    Requests:
      cpu:     10m
      memory:  128Mi
    Environment:
      MINIO_ACCESS_KEY:       minio
      MINIO_SECRET_KEY:       miniostorage
      MINIO_DEFAULT_BUCKETS:  my-s3-bucket
    Mounts:
      /data from minio-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85j2l (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  minio-storage:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-85j2l:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  2m11s              default-scheduler  Successfully assigned flyte/minio-55b8c8f4bc-r4zb9 to 2d489bb89f76
  Normal   Pulling    2m4s               kubelet            Pulling image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0"
  Normal   Pulled     29s                kubelet            Successfully pulled image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0" in 1m34.861375507s
  Normal   Pulled     18s                kubelet            Container image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0" already present on machine
  Normal   Created    18s (x2 over 29s)  kubelet            Created container minio
  Normal   Started    18s (x2 over 29s)  kubelet            Started container minio
  Warning  BackOff    7s                 kubelet            Back-off restarting failed container

nanohanno avatar Jul 27 '22 10:07 nanohanno

@nanohanno can you also post the logs of the crashing pod i.e. minio.

yindia avatar Jul 27 '22 11:07 yindia

Unfortunately, I did not save the logs. I can try to reproduce it and save logs.

nanohanno avatar Jul 27 '22 12:07 nanohanno

I could reproduce the issue now with throttling the download speed using wondershaper. I did not observe the minio pod crashing but the initial behaviour was the same, that even though the timeout time was increased, it could not start the demo cluster:

flytectl demo start --env FLYTE_TIMEOUT=1800 returned after ~5 min with

+---------------------------------------------+---------------+-----------+
|                   SERVICE                   |    STATUS     | NAMESPACE |
+---------------------------------------------+---------------+-----------+
| minio-55b8c8f4bc-jdbnd                      | Pending       | flyte     |
+---------------------------------------------+---------------+-----------+
| postgres-bdb75f779-s8pt7                    | Pending       | flyte     |
+---------------------------------------------+---------------+-----------+
| flyte-kubernetes-dashboard-7fd989b99d-5xrbp | Pending       | flyte     |
+---------------------------------------------+---------------+-----------+
Error: Get "https://127.0.0.1:30086/api/v1/nodes": dial tcp 127.0.0.1:30086: connect: connection refused

Before the container exited, kubectl describe pod -n flyte gave

Name:           postgres-bdb75f779-fk5s7
Namespace:      flyte
Priority:       0
Node:           425cdc8ffa33/172.17.0.2
Start Time:     Fri, 29 Jul 2022 13:09:15 +0000
Labels:         app.kubernetes.io/instance=flyte
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=postgres
                helm.sh/chart=flyte-deps-v1.1.0
                pod-template-hash=bdb75f779
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/postgres-bdb75f779
Containers:
  postgres:
    Container ID:   
    Image:          ecr.flyte.org/ubuntu/postgres:13-21.04_beta
    Image ID:       
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     10m
      memory:  128Mi
    Environment:
      POSTGRES_HOST_AUTH_METHOD:  trust
      POSTGRES_DB:                flyteadmin
    Mounts:
      /var/lib/postgresql/data from postgres-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vg4gt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  postgres-storage:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-vg4gt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m46s  default-scheduler  Successfully assigned flyte/postgres-bdb75f779-fk5s7 to 425cdc8ffa33
  Normal  Pulling    3m39s  kubelet            Pulling image "ecr.flyte.org/ubuntu/postgres:13-21.04_beta"


Name:           minio-55b8c8f4bc-z4jtf
Namespace:      flyte
Priority:       0
Node:           425cdc8ffa33/172.17.0.2
Start Time:     Fri, 29 Jul 2022 13:09:15 +0000
Labels:         app.kubernetes.io/instance=flyte
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=minio
                helm.sh/chart=flyte-deps-v1.1.0
                pod-template-hash=55b8c8f4bc
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/minio-55b8c8f4bc
Containers:
  minio:
    Container ID:   
    Image:          ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0
    Image ID:       
    Ports:          9000/TCP, 9001/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  512Mi
    Requests:
      cpu:     10m
      memory:  128Mi
    Environment:
      MINIO_ACCESS_KEY:       minio
      MINIO_SECRET_KEY:       miniostorage
      MINIO_DEFAULT_BUCKETS:  my-s3-bucket
    Mounts:
      /data from minio-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2xcm9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  minio-storage:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-2xcm9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m46s  default-scheduler  Successfully assigned flyte/minio-55b8c8f4bc-z4jtf to 425cdc8ffa33
  Normal  Pulling    3m39s  kubelet            Pulling image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0"


Name:           flyte-kubernetes-dashboard-7fd989b99d-qnjds
Namespace:      flyte
Priority:       0
Node:           425cdc8ffa33/172.17.0.2
Start Time:     Fri, 29 Jul 2022 13:09:15 +0000
Labels:         app.kubernetes.io/component=kubernetes-dashboard
                app.kubernetes.io/instance=flyte
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kubernetes-dashboard
                app.kubernetes.io/version=2.2.0
                helm.sh/chart=kubernetes-dashboard-4.0.2
                pod-template-hash=7fd989b99d
Annotations:    seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/flyte-kubernetes-dashboard-7fd989b99d
Containers:
  kubernetes-dashboard:
    Container ID:  
    Image:         kubernetesui/dashboard:v2.2.0
    Image ID:      
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --namespace=flyte
      --metrics-provider=none
      --enable-skip-login
      --enable-insecure-login
      --disable-settings-authorizer
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  200Mi
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hp7zx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flyte-kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-hp7zx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m46s  default-scheduler  Successfully assigned flyte/flyte-kubernetes-dashboard-7fd989b99d-qnjds to 425cdc8ffa33
  Normal  Pulling    3m39s  kubelet            Pulling image "kubernetesui/dashboard:v2.2.0"

nanohanno avatar Jul 29 '22 13:07 nanohanno

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 03 '23 00:09 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 12 '23 01:09 github-actions[bot]

There's only so much we can do to ease this pain, including making the images single-binary relies on slimmer.

eapolinario avatar Dec 22 '23 19:12 eapolinario