Apisix-Etcd-0 CrashLoopBackOff
Name and Version
targetRevision: v2.7.0
###EKS Version 1.29
What architecture are you using?
amd64
What steps will reproduce the bug?
Deploy the chart
Are you using any custom parameters or values?
Yes
What is the expected behavior?
apisix-etcd-0 apisix-etcd-1 and apisix-etcd-2 in a Running mode
What do you see instead?
CrashLoopBackOff
{"level":"warn","ts":"2024-07-08T10:17:39.664Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2024-07-08T10:17:39.664Z","caller":"etcdserver/server.go:1128","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2073","msg":"stopped publish because server is stopped","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
kubectl get all -n api-gw
NAME READY STATUS RESTARTS AGE
pod/apisix-6fdf6b9c66-64z2f 1/1 Running 0 17h
pod/apisix-6fdf6b9c66-brsjk 1/1 Running 0 17h
pod/apisix-6fdf6b9c66-jz2rk 1/1 Running 0 17h
pod/apisix-etcd-0 0/1 CrashLoopBackOff 203 (38s ago) 17h
pod/apisix-etcd-1 0/1 CrashLoopBackOff 202 (5m2s ago) 17h
pod/apisix-etcd-2 1/1 Running 0 17h
pod/apisix-ingress-controller-844c65bfdf-5v799 1/1 Running 0 17h
pod/apisix-ingress-controller-844c65bfdf-fzhvg 1/1 Running 0 22h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/apisix-admin ClusterIP 172.20.187.18 <none> 9180/TCP 22h
service/apisix-etcd ClusterIP 172.20.177.186 <none> 2379/TCP,2380/TCP 22h
service/apisix-etcd-headless ClusterIP None <none> 2379/TCP,2380/TCP 22h
service/apisix-gateway NodePort 172.20.111.3 <none> 80:31196/TCP,443:31359/TCP 22h
service/apisix-ingress-controller ClusterIP 172.20.246.190 <none> 80/TCP 22h
service/apisix-ingress-controller-apisix-gateway NodePort 172.20.219.206 <none> 80:30461/TCP,443:31371/TCP 22h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/apisix 3/3 3 3 22h
deployment.apps/apisix-ingress-controller 2/2 2 2 22h
NAME DESIRED CURRENT READY AGE
replicaset.apps/apisix-6fdf6b9c66 3 3 3 22h
replicaset.apps/apisix-ingress-controller-844c65bfdf 2 2 2 22h
NAME READY AGE
statefulset.apps/apisix-etcd 1/3 22h
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
horizontalpodautoscaler.autoscaling/apisix Deployment/apisix 7%/80%, 61%/80% 3 6 3 22h
kubectl describe pods -n api-gw apisix-etcd-0
Name: apisix-etcd-0
Namespace: api-gw
Priority: 0
Service Account: default
Node: ip-10-0-18-102.eu-north-1.compute.internal/10.0.18.102
Start Time: Sun, 07 Jul 2024 20:15:28 +0300
Labels: app.kubernetes.io/instance=apisix
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=etcd
apps.kubernetes.io/pod-index=0
controller-revision-hash=apisix-etcd-5d9864fd68
helm.sh/chart=etcd-8.7.7
statefulset.kubernetes.io/pod-name=apisix-etcd-0
Annotations: checksum/token-secret: 622d20823882c1300c1be66970c8a4304a57e6d674f4c7da8a29e8e8062bb7c1
Status: Running
IP: 10.0.18.31
IPs:
IP: 10.0.18.31
Controlled By: StatefulSet/apisix-etcd
Containers:
etcd:
Container ID: containerd://3e7d388fe249ab387b0f2af890addeffc3fe592b8ee8f4e47362d2e6dd33f13a
Image: docker.io/bitnami/etcd:3.5.7-debian-11-r14
Image ID: docker.io/bitnami/etcd@sha256:0825cafa1c5f0c97d86009f3af8c0f5a9d4279fcfdeb0a2a09b84a1eb7893a13
Ports: 2379/TCP, 2380/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 08 Jul 2024 13:27:59 +0300
Finished: Mon, 08 Jul 2024 13:28:04 +0300
Ready: False
Restart Count: 203
Liveness: exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
Readiness: exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
BITNAMI_DEBUG: false
MY_POD_IP: (v1:status.podIP)
MY_POD_NAME: apisix-etcd-0 (v1:metadata.name)
MY_STS_NAME: apisix-etcd
ETCDCTL_API: 3
ETCD_ON_K8S: yes
ETCD_START_FROM_SNAPSHOT: no
ETCD_DISASTER_RECOVERY: no
ETCD_NAME: $(MY_POD_NAME)
ETCD_DATA_DIR: /bitnami/etcd/data
ETCD_LOG_LEVEL: info
ALLOW_NONE_AUTHENTICATION: yes
ETCD_AUTH_TOKEN: jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m
ETCD_ADVERTISE_CLIENT_URLS: http://$(MY_POD_NAME).apisix-etcd-headless.api-gw.svc.cluster.local:2379,http://apisix-etcd.api-gw.svc.cluster.local:2379
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://$(MY_POD_NAME).apisix-etcd-headless.api-gw.svc.cluster.local:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_CLUSTER_TOKEN: etcd-cluster-k8s
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER: apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.api-gw.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.api-gw.svc.cluster.local:2380
ETCD_CLUSTER_DOMAIN: apisix-etcd-headless.api-gw.svc.cluster.local
NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME: dev
NEW_RELIC_METADATA_KUBERNETES_NODE_NAME: (v1:spec.nodeName)
NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME: api-gw (v1:metadata.namespace)
NEW_RELIC_METADATA_KUBERNETES_POD_NAME: apisix-etcd-0 (v1:metadata.name)
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME: etcd
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME: docker.io/bitnami/etcd:3.5.7-debian-11-r14
Mounts:
/bitnami/etcd from data (rw)
/opt/bitnami/etcd/certs/token/ from etcd-jwt-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h8wk9 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-apisix-etcd-0
ReadOnly: false
etcd-jwt-token:
Type: Secret (a volume populated by a Secret)
SecretName: apisix-etcd-jwt-token
Optional: false
kube-api-access-h8wk9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2s (x4811 over 17h) kubelet Back-off restarting failed container etcd in pod apisix-etcd-0_api-gw(607771a9-2674-40b5-a6d5-4be0971f0706)
I also distribute and operate APISIX to EKS, and I have experienced the same phenomenon for a long time.
In my experience, I identified it as a problem caused by damage to Quorum in the process of node rearrangement of each pod of ETCD due to the influence of Karpenter operating in EKS.
I don't know Karpenter in detail, so I can't give you any advice on this, but in order to reduce the possibility of Quorum damage due to node rearrangement, I have changed to increase the replicaCount of etcd and have stabilized recently to resolve the related phenomenon.
Hi @kworkbee ,
I am using Cluster Autoscaler, not Karpenter. Additionally, since I am implementing high availability, I have configured etcd to mount to EFS using a StorageClass. This setup ensures compatibility across all my availability zones.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-apisix
provisioner: efs.csi.aws.com
allowVolumeExpansion: true
parameters:
provisioningMode: efs-ap
fileSystemId: fs-xxxxxxxx
directoryPerms: "777"
gidRangeStart: "1000"
gidRangeEnd: "2000"
reclaimPolicy: Retain
mountOptions:
- tls
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-apisix
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-apisix
csi:
driver: efs.csi.aws.com
volumeHandle: fs-xxxxxxxx
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: eks/nodeGroupSize
operator: In
values:
- BIG
- key: eks/efs
operator: In
values:
- indeed
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-apisix-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-apisix
resources:
requests:
storage: 10Gi
using the following command temporarily resolved my issue:
kubectl delete pvc -l app.kubernetes.io/name=etcd -n <namespace>
kubectl delete statefulset apisix-etcd -n <namespace>
Please check https://github.com/apache/apisix/issues/11338#issuecomment-3162166205, thanks