[BUG] Demo cluster not starting on slow internet connection
Describe the bug
Starting the demo cluster via flytectl demo start returned after ~5 min with this error message
+---------------------------------------------+---------------+-----------+
| SERVICE | STATUS | NAMESPACE |
+---------------------------------------------+---------------+-----------+
| flyte-kubernetes-dashboard-7fd989b99d-hgmqb | Pending | flyte |
+---------------------------------------------+---------------+-----------+
| minio-55b8c8f4bc-mvjz5 | Pending | flyte |
+---------------------------------------------+---------------+-----------+
| postgres-bdb75f779-cngdp | Running | flyte |
+---------------------------------------------+---------------+-----------+
Error: Get "https://127.0.0.1:30086/api/v1/nodes": dial tcp 127.0.0.1:30086: connect: connection refused
Running flytectl demo exec -- kubectl describe pod -n flyte shows that the pending pods are pulling the image before it exits. Also, it turned out that my internet connection was slow but increasing $FLYTE_TIMEOUT as recommended in https://github.com/flyteorg/flyte/issues/2197 did not help.
Could it be that it fails while waiting for the deployments to be ready (https://github.com/flyteorg/flyte/blob/cf24edfbb8c55be5d29c96f7f6ba761ceb44003f/docker/sandbox-lite/flyte-entrypoint-dind.sh#L63) when it is still loading the image? The timeout --timeout=5m is not affected by changing $FLYTE_TIMEOUT I guess.
Expected behavior
The demo cluster starts succesfully.
Additional context to reproduce
- Slow internet connection :)
-
flytectl demo start
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Thank you for opening your first issue here! 🛠
Hmm thank you for the bug, I wonder if it is one of the dockerhub images. Cc @evalsocket
It's not a bug, I just tested the demo start and it worked.
@nanohanno Can you please follow the troubleshooting docs https://docs.flyte.org/en/latest/community/troubleshoot.html#troubleshooting-guide and let us know what's happing in the cluster
Today with a fast internet connection it works again. During yesterday's attempts to start it I did what is recommended in the trouble shooting guide. This report is from shortly before it exited with the message posted above with the difference that in this attempt other pods were in pending or running status .:
>>> flytectl demo exec -- kubectl describe pod -n flyte
Name: postgres-bdb75f779-6cftr
Namespace: flyte
Priority: 0
Node: 2d489bb89f76/172.17.0.2
Start Time: Tue, 26 Jul 2022 13:28:03 +0000
Labels: app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=postgres
helm.sh/chart=flyte-deps-v1.1.0
pod-template-hash=bdb75f779
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/postgres-bdb75f779
Containers:
postgres:
Container ID:
Image: ecr.flyte.org/ubuntu/postgres:13-21.04_beta
Image ID:
Port: 5432/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 512Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
POSTGRES_HOST_AUTH_METHOD: trust
POSTGRES_DB: flyteadmin
Mounts:
/var/lib/postgresql/data from postgres-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5rh7f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
postgres-storage:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-5rh7f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned flyte/postgres-bdb75f779-6cftr to 2d489bb89f76
Normal Pulling 2m4s kubelet Pulling image "ecr.flyte.org/ubuntu/postgres:13-21.04_beta"
Name: flyte-kubernetes-dashboard-7fd989b99d-k4m8d
Namespace: flyte
Priority: 0
Node: 2d489bb89f76/172.17.0.2
Start Time: Tue, 26 Jul 2022 13:28:03 +0000
Labels: app.kubernetes.io/component=kubernetes-dashboard
app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kubernetes-dashboard
app.kubernetes.io/version=2.2.0
helm.sh/chart=kubernetes-dashboard-4.0.2
pod-template-hash=7fd989b99d
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/flyte-kubernetes-dashboard-7fd989b99d
Containers:
kubernetes-dashboard:
Container ID:
Image: kubernetesui/dashboard:v2.2.0
Image ID:
Port: 9090/TCP
Host Port: 0/TCP
Args:
--namespace=flyte
--metrics-provider=none
--enable-skip-login
--enable-insecure-login
--disable-settings-authorizer
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 200Mi
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wcp7g (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: flyte-kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-wcp7g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned flyte/flyte-kubernetes-dashboard-7fd989b99d-k4m8d to 2d489bb89f76
Normal Pulling 2m4s kubelet Pulling image "kubernetesui/dashboard:v2.2.0"
Name: minio-55b8c8f4bc-r4zb9
Namespace: flyte
Priority: 0
Node: 2d489bb89f76/172.17.0.2
Start Time: Tue, 26 Jul 2022 13:28:03 +0000
Labels: app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=minio
helm.sh/chart=flyte-deps-v1.1.0
pod-template-hash=55b8c8f4bc
Annotations: <none>
Status: Running
IP: 10.42.0.3
IPs:
IP: 10.42.0.3
Controlled By: ReplicaSet/minio-55b8c8f4bc
Containers:
minio:
Container ID: docker://0e6d5d4c9f4ddbe008e4b44f7ab076d42b81b0405bd497fcc81a84929b81b190
Image: ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0
Image ID: docker-pullable://ecr.flyte.org/bitnami/minio@sha256:547a4d0fdc82d5213fef3f4f7215fee788398238c5184a8555837bd3f649525e
Ports: 9000/TCP, 9001/TCP
Host Ports: 0/TCP, 0/TCP
State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 26 Jul 2022 13:29:56 +0000
Finished: Tue, 26 Jul 2022 13:30:06 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 26 Jul 2022 13:29:45 +0000
Finished: Tue, 26 Jul 2022 13:29:55 +0000
Ready: False
Restart Count: 1
Limits:
cpu: 200m
memory: 512Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: miniostorage
MINIO_DEFAULT_BUCKETS: my-s3-bucket
Mounts:
/data from minio-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85j2l (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
minio-storage:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-85j2l:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned flyte/minio-55b8c8f4bc-r4zb9 to 2d489bb89f76
Normal Pulling 2m4s kubelet Pulling image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0"
Normal Pulled 29s kubelet Successfully pulled image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0" in 1m34.861375507s
Normal Pulled 18s kubelet Container image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0" already present on machine
Normal Created 18s (x2 over 29s) kubelet Created container minio
Normal Started 18s (x2 over 29s) kubelet Started container minio
Warning BackOff 7s kubelet Back-off restarting failed container
@nanohanno can you also post the logs of the crashing pod i.e. minio.
Unfortunately, I did not save the logs. I can try to reproduce it and save logs.
I could reproduce the issue now with throttling the download speed using wondershaper. I did not observe the minio pod crashing but the initial behaviour was the same, that even though the timeout time was increased, it could not start the demo cluster:
flytectl demo start --env FLYTE_TIMEOUT=1800 returned after ~5 min with
+---------------------------------------------+---------------+-----------+
| SERVICE | STATUS | NAMESPACE |
+---------------------------------------------+---------------+-----------+
| minio-55b8c8f4bc-jdbnd | Pending | flyte |
+---------------------------------------------+---------------+-----------+
| postgres-bdb75f779-s8pt7 | Pending | flyte |
+---------------------------------------------+---------------+-----------+
| flyte-kubernetes-dashboard-7fd989b99d-5xrbp | Pending | flyte |
+---------------------------------------------+---------------+-----------+
Error: Get "https://127.0.0.1:30086/api/v1/nodes": dial tcp 127.0.0.1:30086: connect: connection refused
Before the container exited, kubectl describe pod -n flyte gave
Name: postgres-bdb75f779-fk5s7
Namespace: flyte
Priority: 0
Node: 425cdc8ffa33/172.17.0.2
Start Time: Fri, 29 Jul 2022 13:09:15 +0000
Labels: app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=postgres
helm.sh/chart=flyte-deps-v1.1.0
pod-template-hash=bdb75f779
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/postgres-bdb75f779
Containers:
postgres:
Container ID:
Image: ecr.flyte.org/ubuntu/postgres:13-21.04_beta
Image ID:
Port: 5432/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 512Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
POSTGRES_HOST_AUTH_METHOD: trust
POSTGRES_DB: flyteadmin
Mounts:
/var/lib/postgresql/data from postgres-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vg4gt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
postgres-storage:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-vg4gt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m46s default-scheduler Successfully assigned flyte/postgres-bdb75f779-fk5s7 to 425cdc8ffa33
Normal Pulling 3m39s kubelet Pulling image "ecr.flyte.org/ubuntu/postgres:13-21.04_beta"
Name: minio-55b8c8f4bc-z4jtf
Namespace: flyte
Priority: 0
Node: 425cdc8ffa33/172.17.0.2
Start Time: Fri, 29 Jul 2022 13:09:15 +0000
Labels: app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=minio
helm.sh/chart=flyte-deps-v1.1.0
pod-template-hash=55b8c8f4bc
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/minio-55b8c8f4bc
Containers:
minio:
Container ID:
Image: ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0
Image ID:
Ports: 9000/TCP, 9001/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 200m
memory: 512Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: miniostorage
MINIO_DEFAULT_BUCKETS: my-s3-bucket
Mounts:
/data from minio-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2xcm9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
minio-storage:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-2xcm9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m46s default-scheduler Successfully assigned flyte/minio-55b8c8f4bc-z4jtf to 425cdc8ffa33
Normal Pulling 3m39s kubelet Pulling image "ecr.flyte.org/bitnami/minio:2021.10.13-debian-10-r0"
Name: flyte-kubernetes-dashboard-7fd989b99d-qnjds
Namespace: flyte
Priority: 0
Node: 425cdc8ffa33/172.17.0.2
Start Time: Fri, 29 Jul 2022 13:09:15 +0000
Labels: app.kubernetes.io/component=kubernetes-dashboard
app.kubernetes.io/instance=flyte
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kubernetes-dashboard
app.kubernetes.io/version=2.2.0
helm.sh/chart=kubernetes-dashboard-4.0.2
pod-template-hash=7fd989b99d
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/flyte-kubernetes-dashboard-7fd989b99d
Containers:
kubernetes-dashboard:
Container ID:
Image: kubernetesui/dashboard:v2.2.0
Image ID:
Port: 9090/TCP
Host Port: 0/TCP
Args:
--namespace=flyte
--metrics-provider=none
--enable-skip-login
--enable-insecure-login
--disable-settings-authorizer
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 200Mi
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hp7zx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: flyte-kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-hp7zx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m46s default-scheduler Successfully assigned flyte/flyte-kubernetes-dashboard-7fd989b99d-qnjds to 425cdc8ffa33
Normal Pulling 3m39s kubelet Pulling image "kubernetesui/dashboard:v2.2.0"
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏
There's only so much we can do to ease this pain, including making the images single-binary relies on slimmer.