Deploy releases PV causing pod not to find available nodes
Operator version: 0.13.0 CH Version: 21.2.5.5
I am facing an issue where 1 pod will consistently release the PV on deploys.
The CH install is a cluster of 1 shard, 3 replica with:
-
local-storagepersistent volume, on AWS SSD - Each replica should be on a separate node with
ShardAntiAffinity - Each replica has the same templated setup
On every cluster deploy, all replicas will restart fine except for replica-0-2-0. Replica 0-2-0 will not be able to re-use the old PV because for some reason the PV is released:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-ch-pv-0 1000Gi RWO Retain Released clickhouse-system/data-storage-vc-template-chi-health-pv-0-2-0 local-storage 23h
local-ch-pv 1000Gi RWO Retain Bound clickhouse-system/data-storage-vc-template-chi-health-pv-0-1-0 local-storage 5d18h
local-ch-pv-1 1000Gi RWO Retain Bound clickhouse-system/data-storage-vc-template-chi-health-pv-0-0-0 local-storage 5d3h
The 0-2-0 pod will then be stuck in PENDING because it cannot find a node with a PV that is not already claimed.
From the K8s docs, the PV should not be released unless the PVC is deleted. However, the PVC did not get deleted (large age). The PVC now waits for the pod to come up and we are then in a deadlock state: Pod -> PVC -> PV (claimed by PVC already)
PVC:
Normal WaitForFirstConsumer 3m23s (x462 over 118m) persistentvolume-controller waiting for first consumer to be created before binding
Pod 0-2-0:
Warning FailedScheduling 11s (x5 over 3m51s) default-scheduler .....3 node(s) didn't find available persistent volumes to bind....
This happens to replica 2 consistently. Replica 0 and 1 deploy fine and the PVs stay bound. We can temporarily fix the issue by deleting the PV or removing the claim from the PV.
Any ideas why this would be happening on every deploy?
Hi @chanadian , could you attach the full CHI (mask sensitive data). Also, please consider trying the new version 0.14.0. It works more reliably with k8s resources.
Thanks we will try 0.14.0. We tried 0.13.5 and same issue was happening.
Some more debugging info:
- EC2 instance is i3.2xl (https://aws.amazon.com/ec2/instance-types/i3/)
- volume claim is for 400Gi and PV is 1000Gi
- 1 shard, 3 replicas with shard anti affinity to keep each replica on different hosts
Here's a sample of our CHI:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "ssdtest"
namespace: clickhouse-system
spec:
defaults:
templates:
podTemplate: default
configuration:
settings:
http_port: 8123
tcp_port: 9000
interserver_http_port: 9009
zookeeper:
nodes:
- host: sampleZookeeperURI
port: 2181
clusters:
- name: "repl"
layout:
shardsCount: 1
replicasCount: 3
templates:
podTemplates:
- name: default
labels:
app: clickhouse-health-ssd
podDistribution:
- type: ShardAntiAffinity
spec:
containers:
- name: clickhouse
image: sampleImageURI
ports:
- name: http
containerPort: 8123
- name: tcp
containerPort: 9000
- name: interserver
containerPort: 9009
volumeMounts:
- name: data-storage-vc-template
mountPath: /var/lib/clickhouse
nodeSelector:
app/node-type: clickhouse
tolerations:
- key: "app/restricted"
operator: Equal
value: clickhouse
effect: NoSchedule
volumeClaimTemplates:
- name: data-storage-vc-template
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
resources:
requests:
storage: 400Gi
Here's the PV yaml:
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-ch-pv
namespace: clickhouse-system
spec:
capacity:
storage: 1000Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /var/lib/stateful/service
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: app/node-type
operator: In
values:
- clickhouse