pulsar-helm-chart icon indicating copy to clipboard operation
pulsar-helm-chart copied to clipboard

Node does not exist

Open nise-wg2 opened this issue 2 years ago • 3 comments

Describe the bug Cluster never starts, some pods remain in init state:

NAME                      READY   STATUS     RESTARTS   AGE
pulsar-mini-bookie-0      0/1     Running    0          47s
pulsar-mini-broker-0      0/1     Init:0/2   0          47s
pulsar-mini-proxy-0       0/1     Init:0/2   0          47s
pulsar-mini-toolset-0     1/1     Running    0          47s
pulsar-mini-zookeeper-0   1/1     Running    0          47s

Error in init:

pulsar-mini-proxy-0 wait-zookeeper-ready WATCHER::
pulsar-mini-proxy-0 wait-zookeeper-ready
pulsar-mini-proxy-0 wait-zookeeper-ready WatchedEvent state:SyncConnected type:None path:null zxid: -1
pulsar-mini-proxy-0 wait-zookeeper-ready Node does not exist: /admin/clusters/pulsar-mini
pulsar-mini-proxy-0 wait-zookeeper-ready 2024-02-22T14:39:47,542+0000 [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 1
pulsar-mini-proxy-0 wait-zookeeper-ready Connecting to pulsar-mini-zookeeper
pulsar-mini-proxy-0 wait-zookeeper-ready 2024-02-22T14:39:54,142+0000 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.9.1-40487256d9b9f274484798758699e49c26d91cda, built on 2023-10-02 15:06 UTC
pulsar-mini-proxy-0 wait-zookeeper-ready 2024-02-22T14:39:54,145+0000 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:host.name=pulsar-mini-proxy-0.pulsar-mini-proxy.cdr.svc.cluster.local

values file mini-values.yaml:

---

components:
  ## pulsar-manager: disable
  pulsar_manager: false
  ## zookeeper
  zookeeper: true
  ## bookkeeper
  bookkeeper: true
  ## Disable bookkeeper - autorecovery
  autorecovery: false
  ## broker
  broker: true
  ## functions
  functions: true
  ## proxy
  proxy: true
  ## toolset
  toolset: true

## disable monitoring stack
kube-prometheus-stack:
  enabled: false
  prometheusOperator:
    enabled: false
  grafana:
    enabled: false
  alertmanager:
    enabled: false
  prometheus:
    enabled: false

# Disable persistence
volumes:
  persistence: false

zookeeper:
  replicaCount: 1
  externalZookeeperServerList: ""
  # Disable pod monitor since we're disabling CRD installation
  podMonitor:
    enabled: false

bookkeeper:
  replicaCount: 1
  configData:
    # minimal memory use for bookkeeper
    # https://bookkeeper.apache.org/docs/reference/config#db-ledger-storage-settings
    dbStorage_writeCacheMaxSizeMb: "32"
    dbStorage_readAheadCacheMaxSizeMb: "32"
    dbStorage_rocksDB_writeBufferSizeMB: "8"
    dbStorage_rocksDB_blockCacheSize: "8388608"
  # Disable pod monitor since we're disabling CRD installation
  podMonitor:
    enabled: false

broker:
  replicaCount: 1
  configData:
    ## Enable `autoSkipNonRecoverableData` since bookkeeper is running
    autoSkipNonRecoverableData: "true"
    # storage settings
    managedLedgerDefaultEnsembleSize: "1"
    managedLedgerDefaultWriteQuorum: "1"
    managedLedgerDefaultAckQuorum: "1"
  podMonitor:
    enabled: false

proxy:
  replicaCount: 1
  podMonitor:
    enabled: false

Then:

helm install --values mini-values.yaml --namespace cdr pulsar-mini apache/pulsar

Expected behavior The cluster starts.

Additional context

Running ./scripts/pulsar/prepare_helm_release.sh -k pulsar-mini -n cdr before the helm install does nothing to the error.

✗ helm ls NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION pulsar-mini cdr 1 2024-02-22 15:54:31.147885 +0100 CET deployed pulsar-3.2.0 3.0.2

✗ helm version version.BuildInfo{Version:"v3.14.0", GitCommit:"3fc9f4b2638e76f26739cd77c7017139be81d0ea", GitTreeState:"clean", GoVersion:"go1.21.6"}

K8s cluster: v1.21.13

nise-wg2 avatar Feb 22 '24 15:02 nise-wg2

@nise-wg2 I don't see the init jobs running? Did they deploy OK?

frankjkelly avatar Feb 22 '24 15:02 frankjkelly

✗ k get jobs -l "app=pulsar"
NAME                      COMPLETIONS   DURATION   AGE
pulsar-mini-bookie-init   1/1           15s        55s
pulsar-mini-pulsar-init   0/1                      55s

And it fails b/c:

Warning FailedCreate 5s (x4 over 75s) job-controller Error creating: admission webhook "validation.gatekeeper.sh" denied the request: [wgtwo-reg] container <pulsar-mini-pulsar-init> has an invalid image repo <apachepulsar/pulsar-all:3.0.2>,

I had missed that.

nise-wg2 avatar Feb 22 '24 15:02 nise-wg2

So, there is a bug in the helm chart as this Job is not present in the https://github.com/apache/pulsar-helm-chart/blob/master/charts/pulsar/values.yaml#L137 list of images.

Requires an override in the values like:

pulsar_metadata:
  image:
    repository: registry.wgtwo.com/reg/cdr/testing/apachepulsar/pulsar-all
    tag: 3.1.2

Then it seems to start.

 k get pods -w -l "app=pulsar"
NAME                      READY   STATUS    RESTARTS   AGE
pulsar-mini-bookie-0      1/1     Running   0          115s
pulsar-mini-broker-0      1/1     Running   0          115s
pulsar-mini-proxy-0       0/1     Running   0          115s
pulsar-mini-toolset-0     1/1     Running   0          115s
pulsar-mini-zookeeper-0   1/1     Running   0          115s

nise-wg2 avatar Feb 22 '24 15:02 nise-wg2