clickhouse-operator Deploy releases PV causing pod not to find available nodes

Operator version: 0.13.0 CH Version: 21.2.5.5

I am facing an issue where 1 pod will consistently release the PV on deploys.

The CH install is a cluster of 1 shard, 3 replica with:

local-storage persistent volume, on AWS SSD
Each replica should be on a separate node with ShardAntiAffinity
Each replica has the same templated setup

On every cluster deploy, all replicas will restart fine except for replica-0-2-0. Replica 0-2-0 will not be able to re-use the old PV because for some reason the PV is released:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                                             STORAGECLASS                            REASON   AGE
local-ch-pv-0                              1000Gi     RWO            Retain           Released    clickhouse-system/data-storage-vc-template-chi-health-pv-0-2-0                    local-storage                                    23h
local-ch-pv                                1000Gi     RWO            Retain           Bound       clickhouse-system/data-storage-vc-template-chi-health-pv-0-1-0                    local-storage                                    5d18h
local-ch-pv-1                              1000Gi     RWO            Retain           Bound       clickhouse-system/data-storage-vc-template-chi-health-pv-0-0-0                    local-storage                                    5d3h

The 0-2-0 pod will then be stuck in PENDING because it cannot find a node with a PV that is not already claimed.

From the K8s docs, the PV should not be released unless the PVC is deleted. However, the PVC did not get deleted (large age). The PVC now waits for the pod to come up and we are then in a deadlock state: Pod -> PVC -> PV (claimed by PVC already)

PVC:
Normal  WaitForFirstConsumer  3m23s (x462 over 118m)  persistentvolume-controller  waiting for first consumer to be created before binding

Pod 0-2-0:
Warning  FailedScheduling   11s (x5 over 3m51s)  default-scheduler   .....3 node(s) didn't find available persistent volumes to bind....

This happens to replica 2 consistently. Replica 0 and 1 deploy fine and the PVs stay bound. We can temporarily fix the issue by deleting the PV or removing the claim from the PV.

Any ideas why this would be happening on every deploy?

Mar 10 '21 06:03 chanadian

Hi @chanadian , could you attach the full CHI (mask sensitive data). Also, please consider trying the new version 0.14.0. It works more reliably with k8s resources.

Apr 27 '21 07:04 alex-zaitsev

Thanks we will try 0.14.0. We tried 0.13.5 and same issue was happening.

Some more debugging info:

EC2 instance is i3.2xl (https://aws.amazon.com/ec2/instance-types/i3/)
volume claim is for 400Gi and PV is 1000Gi
1 shard, 3 replicas with shard anti affinity to keep each replica on different hosts

Here's a sample of our CHI:

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "ssdtest"
  namespace: clickhouse-system
spec:
  defaults:
    templates:
      podTemplate: default  
  configuration:
    settings:
      http_port: 8123
      tcp_port: 9000
      interserver_http_port: 9009
    zookeeper:
      nodes:
        - host: sampleZookeeperURI
          port: 2181
    clusters:
      - name: "repl"
        layout:
          shardsCount: 1
          replicasCount: 3
  templates:
    podTemplates:
      - name: default
        labels:
            app: clickhouse-health-ssd
        podDistribution:
          - type: ShardAntiAffinity
        spec:
          containers:
            - name: clickhouse
              image: sampleImageURI
              ports:
                - name: http
                  containerPort: 8123
                - name: tcp
                  containerPort: 9000
                - name: interserver
                  containerPort: 9009
              volumeMounts:
                - name: data-storage-vc-template
                  mountPath: /var/lib/clickhouse
          nodeSelector:
            app/node-type: clickhouse
          tolerations:
            - key: "app/restricted"
              operator: Equal
              value: clickhouse
              effect: NoSchedule
    volumeClaimTemplates:
      - name: data-storage-vc-template
        spec:
          accessModes:
            - ReadWriteOnce
          storageClassName: local-storage
          resources:
            requests:
              storage: 400Gi

Here's the PV yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-ch-pv
  namespace: clickhouse-system
spec:
  capacity:
    storage: 1000Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /var/lib/stateful/service
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: app/node-type
          operator: In
          values:
          - clickhouse

Apr 27 '21 18:04 chanadian