gcs icon indicating copy to clipboard operation
gcs copied to clipboard

Failed to deploy on OCP 3.11

Open hongkailiu opened this issue 7 years ago • 8 comments

  1. Need scc setup of hostpath plugin: Followed this doc to work it around.

  2. the opc cluster is on aws: the node name is like ip-172-31-10-185.us-west-2.compute.internal which causes problem for the statefulset. Created bz1643191

To workaround this issue: Use a simpler name for the statefulset

# git diff deploy-gcs.yml tasks/create-gd2-manifests.yml templates/gcs-manifests/gcs-gd2.yml.j2
diff --git a/deploy/deploy-gcs.yml b/deploy/deploy-gcs.yml
index 5efd085..fc0bac1 100644
--- a/deploy/deploy-gcs.yml
+++ b/deploy/deploy-gcs.yml
@@ -42,8 +42,10 @@
 
         - name: GCS Pre | Manifests | Create GD2 manifests
           include_tasks: tasks/create-gd2-manifests.yml
           loop: "{{ groups['gcs-node'] }}"
           loop_control:
+            index_var: index
             loop_var: gcs_node
 
   post_tasks:
diff --git a/deploy/tasks/create-gd2-manifests.yml b/deploy/tasks/create-gd2-manifests.yml
index d9a2d2d..4c015ef 100644
--- a/deploy/tasks/create-gd2-manifests.yml
+++ b/deploy/tasks/create-gd2-manifests.yml
@@ -3,6 +3,7 @@
 - name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Set fact kube_hostname
   set_fact:
     kube_hostname: "{{ gcs_node }}"
+    gcs_node_index: "{{ index }}"
 
 - name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Create gcs-gd2-{{ gcs_node }}.yml
   template:
diff --git a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2 b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
index fe48b35..3376b11 100644
--- a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
+++ b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
@@ -2,7 +2,7 @@
 kind: StatefulSet
 apiVersion: apps/v1
 metadata:
-  name: gluster-{{ kube_hostname }}
+  name: gluster-{{ gcs_node_index }}
   namespace: {{ gcs_namespace }}
   labels:
     app.kubernetes.io/part-of: gcs

  1. Then failed on Wait for glusterd2-cluster to become ready
TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018  18:39:50 +0000 (0:00:00.083)       0:00:54.274 ****** 
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).

# oc get pod
NAME                             READY     STATUS    RESTARTS   AGE
etcd-6jmbmv6sw7                  1/1       Running   0          23m
etcd-mvwq6c2w6f                  1/1       Running   0          23m
etcd-n92rtb9wfr                  1/1       Running   0          23m
etcd-operator-54bbdfc55d-mdvd9   1/1       Running   0          24m
gluster-0-0                      1/1       Running   7          23m
gluster-1-0                      1/1       Running   7          23m
gluster-2-0                      1/1       Running   7          23m

# oc describe pod  gluster-1-0
Name:               gluster-1-0
Namespace:          gcs
Priority:           0
PriorityClassName:  <none>
Node:               ip-172-31-59-125.us-west-2.compute.internal/172.31.59.125
Start Time:         Thu, 25 Oct 2018 18:39:49 +0000
Labels:             app.kubernetes.io/component=glusterfs
                    app.kubernetes.io/name=glusterd2
                    app.kubernetes.io/part-of=gcs
                    controller-revision-hash=gluster-1-598d756667
                    statefulset.kubernetes.io/pod-name=gluster-1-0
Annotations:        openshift.io/scc=hostpath
Status:             Running
IP:                 172.21.0.15
Controlled By:      StatefulSet/gluster-1
Containers:
  glusterd2:
    Container ID:   docker://0433446ecbd7a25d5aa9f51f0bd5c3226090850b18d0e63d58d07e47c6fdd039
    Image:          docker.io/gluster/glusterd2-nightly:20180920
    Image ID:       docker-pullable://docker.io/gluster/glusterd2-nightly@sha256:7013c3de3ed2c8b9c380c58b7c331dfc70df39fe13faea653b25034545971072
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Oct 2018 19:00:48 +0000
      Finished:     Thu, 25 Oct 2018 19:03:48 +0000
    Ready:          False
    Restart Count:  7
    Liveness:       http-get http://:24007/ping delay=10s timeout=1s period=60s #success=1 #failure=3
    Environment:
      GD2_ETCDENDPOINTS:  http://etcd-client.gcs:2379
      GD2_CLUSTER_ID:     dd68cd6b-b828-4c13-86a4-35c492b5d4c2
      GD2_CLIENTADDRESS:  gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
      GD2_PEERADDRESS:    gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
      GD2_RESTAUTH:       false
    Mounts:
      /dev from gluster-dev (rw)
      /run/lvm from gluster-lvm (rw)
      /sys/fs/cgroup from gluster-cgroup (ro)
      /usr/lib/modules from gluster-kmods (ro)
      /var/lib/glusterd2 from glusterd2-statedir (rw)
      /var/log/glusterd2 from glusterd2-logdir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-hvj7w (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  gluster-dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  gluster-cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:  
  gluster-lvm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/lvm
    HostPathType:  
  gluster-kmods:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib/modules
    HostPathType:  
  glusterd2-statedir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/glusterd2
    HostPathType:  DirectoryOrCreate
  glusterd2-logdir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/glusterd2
    HostPathType:  DirectoryOrCreate
  default-token-hvj7w:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-hvj7w
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     <none>
Events:
  Type     Reason     Age                From                                                  Message
  ----     ------     ----               ----                                                  -------
  Normal   Scheduled  25m                default-scheduler                                     Successfully assigned gcs/gluster-1-0 to ip-172-31-59-125.us-west-2.compute.internal
  Normal   Pulling    25m                kubelet, ip-172-31-59-125.us-west-2.compute.internal  pulling image "docker.io/gluster/glusterd2-nightly:20180920"
  Normal   Pulled     24m                kubelet, ip-172-31-59-125.us-west-2.compute.internal  Successfully pulled image "docker.io/gluster/glusterd2-nightly:20180920"
  Normal   Created    16m (x4 over 24m)  kubelet, ip-172-31-59-125.us-west-2.compute.internal  Created container
  Normal   Started    16m (x4 over 24m)  kubelet, ip-172-31-59-125.us-west-2.compute.internal  Started container
  Normal   Killing    16m (x3 over 22m)  kubelet, ip-172-31-59-125.us-west-2.compute.internal  Killing container with id docker://glusterd2:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled     16m (x3 over 22m)  kubelet, ip-172-31-59-125.us-west-2.compute.internal  Container image "docker.io/gluster/glusterd2-nightly:20180920" already present on machine
  Warning  Unhealthy  4m (x21 over 24m)  kubelet, ip-172-31-59-125.us-west-2.compute.internal  Liveness probe failed: Get http://172.21.0.15:24007/ping: dial tcp 172.21.0.15:24007: connect: connection refused


hongkailiu avatar Oct 25 '18 19:10 hongkailiu

Forgot to mention: Need to overwrite the path of kubectl.

# ansible-playbook -i ~/aaa/gcs.yml deploy-gcs.yml --extra-vars "kubectl=/usr/bin/kubectl" -v

hongkailiu avatar Oct 25 '18 19:10 hongkailiu

Nice. Thanks for trying it out. :+1: Since you've changed the SS name, I think you'll also need to change GD2_CLIENTADDRESS and GD2_PEERADDRESS. They should end up w/ the name of the pod that gets spawned by the SS that you renamed... i.e., <the_ss_name>-0. I think this will fix the problem of gd2 being unhealthy.

JohnStrunk avatar Oct 25 '18 19:10 JohnStrunk

@JohnStrunk I changed the name by using the new var, the old var kube_hostname is kept intact.

-  name: gluster-{{ kube_hostname }}
+  name: gluster-{{ gcs_node_index }}

# oc get sts gluster-ip-172-31-47-15-us-west-2-compute-internal -o yaml | grep ESS -A1 | grep -v Set
  creationTimestamp: 2018-10-25T19:31:43Z
--
        - name: GD2_CLIENTADDRESS
          value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24007
        - name: GD2_PEERADDRESS
          value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24008

So that should not be the cause. Did I missed something?

hongkailiu avatar Oct 25 '18 19:10 hongkailiu

From above:

   Environment:
      GD2_ETCDENDPOINTS:  http://etcd-client.gcs:2379
      GD2_CLUSTER_ID:     dd68cd6b-b828-4c13-86a4-35c492b5d4c2
      GD2_CLIENTADDRESS:  gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
      GD2_PEERADDRESS:    gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
      GD2_RESTAUTH:       false

Client and peer addresses need to aim at the pod's address. I think you need to update those ENV vars in the template in addition to changing the name field. The pod's name is now gluster-1-0, but client & peer still point to the old name.

JohnStrunk avatar Oct 25 '18 19:10 JohnStrunk

Yes. Now I understand.

New issue:

TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018  19:51:00 +0000 (0:00:00.097)       0:00:54.215 ****** 
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).
fatal: [master]: FAILED! => {"msg": "The conditional check 'result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)' failed. The error was: error while evaluating conditional (result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)): 'dict object' has no attribute 'kube-node'"}

My inv file:

# cat ~/aaa/gcs.yml 
master ansible_host=ip-172-31-43-164.us-west-2.compute.internal

ip-172-31-47-15.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-59-125.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-60-208.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'

[kube-master]
master

[gcs-node]
ip-172-31-47-15.us-west-2.compute.internal
ip-172-31-59-125.us-west-2.compute.internal
ip-172-31-60-208.us-west-2.compute.internal

I think I know the problem. It might be groups['gcs-node'] instead of groups['kube-node']. Will know soon.

hongkailiu avatar Oct 25 '18 20:10 hongkailiu

Now new issue:

TASK [GCS | GD2 Cluster | Add devices | Set facts] *****************************************************************************
Thursday 25 October 2018  20:15:04 +0000 (0:00:00.041)       0:01:30.000 ****** 
ok: [master] => {"ansible_facts": {"kube_hostname": "ip"}, "changed": false}

TASK [GCS | GD2 Cluster | Add devices | Add devices for ip] ********************************************************************
Thursday 25 October 2018  20:15:05 +0000 (0:00:00.115)       0:01:30.115 ****** 
fatal: [master]: FAILED! => {"msg": "u\"hostvars['ip']\" is undefined"}

Will continue tomorrow.

hongkailiu avatar Oct 25 '18 20:10 hongkailiu

Found the problem again. The playbook has strong restrictions on the node name.

https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L5

This assumes that index 1 has the node name after splitting. Will think about the solution tomorrow.

hongkailiu avatar Oct 25 '18 20:10 hongkailiu

https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L13 This line uses the json result from the endpoint and then uses hostname as key to get device in inventory. Maybe we need to wait for the fix/response from bz1643191.

The workaround above fix the problem of the playbook only partially. Cannot go through.

hongkailiu avatar Oct 25 '18 21:10 hongkailiu