Pilosa k8s pod error "bind: cannot assign requested address"

Open AnjanaAK opened this issue 4 years ago • 1 comments

What's going wrong?

Pilosa pod in a k8s cluster enters to a crashloopBackOff state.

What was expected?

Pilosa pod should remain in running state without any issues.

Steps to reproduce the behavior

Create a k8s deployment and service with the following yaml files. ( These are the needed files from my helm chart with rendered values) :

# Source: pilosa/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pilosa
  labels:
    helm.sh/chart: pilosa-0.1.0
    app.kubernetes.io/name: pilosa
    app.kubernetes.io/instance: RELEASE-NAME
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: pilosa
      app.kubernetes.io/instance: RELEASE-NAME
  template:
    metadata:
      labels:
        app.kubernetes.io/name: pilosa
        app.kubernetes.io/instance: RELEASE-NAME
    spec:
      serviceAccountName: default
      securityContext:
        {}
      initContainers:
        - command:
          - /bin/sh
          - -c
          - |
            sysctl -w net.ipv4.tcp_keepalive_time=600
            sysctl -w net.ipv4.tcp_keepalive_intvl=60
            sysctl -w net.ipv4.tcp_keepalive_probes=3
          image: busybox
          name: init-sysctl
          securityContext:
            privileged: true
      containers:
        - name: pilosa
          securityContext:
            {}
          image: "pilosa/pilosa:v1.4.0"
          imagePullPolicy: IfNotPresent
          args:
            - server
            - --data-dir
            - /data
            - --max-writes-per-request
            - "20000"
            - --bind
            - http://pilosa:10101
            - --cluster.coordinator=true
            - --gossip.seeds=pilosa:14000
            - --handler.allowed-origins="*"
          ports:
            - name: http
              containerPort: 10101
              protocol: TCP
          livenessProbe:
            tcpSocket:
              port: http
          readinessProbe:
            tcpSocket:
              port: http
          volumeMounts:
            - name: "pilosa-pv-storage"
              mountPath: /data
          resources:
            {}
      volumes:
      - name: pilosa-pv-storage
        persistentVolumeClaim:
          claimName: pilosa-pv-claim

Service yaml:

# Source: pilosa/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: pilosa
  labels:
    helm.sh/chart: pilosa-0.1.0
    app.kubernetes.io/name: pilosa
    app.kubernetes.io/instance: RELEASE-NAME
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
  - port: 10101
    targetPort:  10101
    protocol: TCP
    name: http
  selector:
    app.kubernetes.io/name: pilosa
    app.kubernetes.io/instance: RELEASE-NAME

Check the pod status:

$ kubectl get pods
NAME                      READY   STATUS             RESTARTS   AGE
pilosa-69574564bc-5f25l   0/1     CrashLoopBackOff   2          71s

Check service:

$ kubectl get svc

NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
pilosa       ClusterIP   10.96.141.79   <none>        10101/TCP   9m14s

Check the pilosa logs:

$ kubectl logs pilosa-69574564bc-5f25l

2021/02/24 14:19:45 Pilosa v1.4.0, build time 2019-09-17T23:29:35+0000
Error: running server: setting up server: getting listener: net.Listen: listen tcp 10.96.141.79:10101: bind: cannot assign requested address
Usage:
  pilosa server [flags]

Flags:
      --advertise string                     Address to advertise externally.
      --anti-entropy.interval duration       Interval at which to run anti-entropy routine. (default 10m0s)
  -b, --bind string                          Default URI on which pilosa should listen. (default ":10101")
      --cluster.coordinator                  Host that will act as cluster coordinator during startup and resizing.
      --cluster.disabled                     Disabled multi-node cluster communication (used for testing)
      --cluster.hosts strings                Comma separated list of hosts in cluster. Only used for testing.
      --cluster.long-query-time duration     Duration that will trigger log and stat messages for slow queries. (default 1m0s)
      --cluster.replicas int                 Number of hosts each piece of data should be stored on. (default 1)
  -d, --data-dir string                      Directory to store pilosa data files. (default "~/.pilosa")
      --gossip.advertise-host string         Host on which memberlist should advertise.
      --gossip.advertise-port string         Port on which memberlist should advertise.
      --gossip.interval duration             Interval between sending messages that need to be gossiped that haven't piggybacked on probing messages. (default 200ms)
running server: setting up server: getting listener: net.Listen: listen tcp 10.96.141.79:10101: bind: cannot assign requested address
      --gossip.key string                    The path to file of the encryption key for gossip. The contents of the file should be either 16, 24, or 32 bytes to select AES-128, AES-192, or AES-256.
      --gossip.nodes int                     Number of random nodes to send gossip messages to per GossipInterval. (default 3)
      --gossip.port string                   Port to which pilosa should bind for internal state sharing. (default "14000")
      --gossip.probe-interval duration       Interval between random node probes. (default 1s)
      --gossip.probe-timeout duration        Timeout to wait for an ack from a probed node before assuming it is unhealthy. (default 500ms)
      --gossip.push-pull-interval duration   Interval between complete state syncs. (default 30s)
      --gossip.seeds strings                 Host with which to seed the gossip membership.
      --gossip.stream-timeout duration       Timeout for establishing a stream connection with a remote node for a full state sync. (default 10s)
      --gossip.suspicion-mult int            Multiplier for determining the time an inaccessible node is considered suspect before declaring it dead. (default 4)
      --gossip.to-the-dead-time duration     Interval after which a node has died that we will still try to gossip to it. (default 30s)
      --handler.allowed-origins strings      Comma separated list of allowed origin URIs (for CORS/WebUI).
  -h, --help                                 help for server
      --log-path string                      Log path
      --max-file-count uint                  Soft limit on the maximum number of fragment files Pilosa keeps open simultaneously. (default 1000000)
      --max-map-count uint                   Limits the maximum number of active mmaps. Pilosa will fall back to reading files once this is exhausted. Set below your system's vm.max_map_count. (default 1000000)
      --max-writes-per-request int           Number of write commands per request. (default 5000)
      --metric.diagnostics                   Enabled diagnostics reporting. (default true)
      --metric.host string                   URI to send metrics when metric.service is statsd.
      --metric.poll-interval duration        Polling interval metrics.
      --metric.service string                Where to send stats: can be expvar (in-memory served at /debug/vars), statsd or none. (default "none")
      --profile.block-rate int               Sampling rate for goroutine blocking profiler. One sample per <rate> ns. (default 10000000)
      --profile.mutex-fraction int           Sampling fraction for mutex contention profiling. Sample 1/<rate> of events. (default 100)
      --tls.certificate string               TLS certificate path (usually has the .crt or .pem extension
      --tls.key string                       TLS certificate key path (usually has the .key extension
      --tls.skip-verify                      Skip TLS certificate verification (not secure)
      --tracing.agent-host-port string       Jaeger agent host:port.
      --tracing.sampler-param float          Jaeger sampler parameter. (default 0.001)
      --tracing.sampler-type string          Jaeger sampler type or 'off' to disable tracing completely. (default "remote")
      --translation.map-size int             Size in bytes of mmap to allocate for key translation.
      --translation.primary-url string       DEPRECATED: URL for primary translation node for replication.
      --verbose                              Enable verbose logging

Global Flags:
  -c, --config string   Configuration file to read from.

In the logs, you can see the error: Error: running server: setting up server: getting listener: net.Listen: listen tcp 10.96.141.79:10101: bind: cannot assign requested address

Information about your environment (OS/architecture, CPU, RAM, cluster/solo, configuration, etc.)

It is a k8s cluster in OCI created with 'quick create' option.

Kubernetes Version :  v1.18.10
Shape : VM.Standard1.4
Image Name : Oracle-Linux-7.9-2020.11.10-1
Total Worker Nodes : 3

Extra Info

If i change the bind argument to --bind http://0.0.0.0:10101 , then pod becomes running. But still there is an error log:

$ kubectl logs pilosa-555589659f-tms9c

2021/02/24 14:27:51 Pilosa v1.4.0, build time 2019-09-17T23:29:35+0000
2021/02/24 14:27:51 load NodeID: /data/.id
2021/02/24 14:28:03 retrying after error: 1 error occurred:
        * Failed to join 10.96.141.79: dial tcp 10.96.141.79:14000: i/o timeout
2021/02/24 14:28:15 retrying after error: 1 error occurred:
        * Failed to join 10.96.141.79: dial tcp 10.96.141.79:14000: i/o timeout

Then if i change gossip.seeds argument also to --gossip.seeds=0.0.0.0:14000, then this error also disappears. Now the log is:

$ kubectl logs pilosa-75b95f695-q6sh6

2021/02/24 14:43:07 Pilosa v1.4.0, build time 2019-09-17T23:29:35+0000
2021/02/24 14:43:07 load NodeID: /data/.id
2021/02/24 14:43:07 open server
2021/02/24 14:43:07 open holder path: /data
2021/02/24 14:43:07 opening index: lost+found
2021/02/24 14:43:07 ERROR opening index: lost+found, err=validating name: 'lost+found': invalid index or field name, must match [a-z][a-z0-9_-]* and contain at most 64 characters
2021/02/24 14:43:07 open holder: complete
2021/02/24 14:43:07 received state READY (4a53d79d-cd0d-45cf-8bc8-94a4d7e4aca8)
2021/02/24 14:43:07 change cluster state from STARTING to NORMAL on 4a53d79d-cd0d-45cf-8bc8-94a4d7e4aca8
2021/02/24 14:43:07 listening as http://0.0.0.0:10101
2021/02/24 14:43:07 diagnostics disabled

Why binding to matching service's IP is not working? Is there any problem in a k8s cluster if i specify 0.0.0.0 that says pilosa to connect to all available interfaces?

Feb 24 '21 14:02 AnjanaAK

I can also see that --bind http://localhost:10101 and --gossip.seeds=localhost:14000 (which is the default value) works fine.' But then the liveness and readyness probes configured fails with the following error, and pod restarts repeatedly.

  Normal   Created    57s (x2 over 89s)  kubelet            Created container pilosa
  Normal   Started    57s (x2 over 88s)  kubelet            Started container pilosa
  Warning  Unhealthy  29s (x6 over 79s)  kubelet            Liveness probe failed: dial tcp 10.1.0.32:10101: connect: connection refused
  Normal   Killing    29s (x2 over 59s)  kubelet            Container pilosa failed liveness probe, will be restarted
  Normal   Pulled     28s (x3 over 90s)  kubelet            Container image "pilosa/pilosa:v1.4.0" already present on machine
  Warning  Unhealthy  28s (x6 over 78s)  kubelet            Readiness probe failed: dial tcp 10.1.0.32:10101: connect: connection refused

This probe connection problem doesn't appear if bind and gossip.seeds are configured to 0.0.0.0.

Mar 04 '21 12:03 AnjanaAK