Failed to deploy on OCP 3.11
-
Need scc setup of hostpath plugin: Followed this doc to work it around.
-
the opc cluster is on aws: the node name is like
ip-172-31-10-185.us-west-2.compute.internalwhich causes problem for the statefulset. Created bz1643191
To workaround this issue: Use a simpler name for the statefulset
# git diff deploy-gcs.yml tasks/create-gd2-manifests.yml templates/gcs-manifests/gcs-gd2.yml.j2
diff --git a/deploy/deploy-gcs.yml b/deploy/deploy-gcs.yml
index 5efd085..fc0bac1 100644
--- a/deploy/deploy-gcs.yml
+++ b/deploy/deploy-gcs.yml
@@ -42,8 +42,10 @@
- name: GCS Pre | Manifests | Create GD2 manifests
include_tasks: tasks/create-gd2-manifests.yml
loop: "{{ groups['gcs-node'] }}"
loop_control:
+ index_var: index
loop_var: gcs_node
post_tasks:
diff --git a/deploy/tasks/create-gd2-manifests.yml b/deploy/tasks/create-gd2-manifests.yml
index d9a2d2d..4c015ef 100644
--- a/deploy/tasks/create-gd2-manifests.yml
+++ b/deploy/tasks/create-gd2-manifests.yml
@@ -3,6 +3,7 @@
- name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Set fact kube_hostname
set_fact:
kube_hostname: "{{ gcs_node }}"
+ gcs_node_index: "{{ index }}"
- name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Create gcs-gd2-{{ gcs_node }}.yml
template:
diff --git a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2 b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
index fe48b35..3376b11 100644
--- a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
+++ b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
@@ -2,7 +2,7 @@
kind: StatefulSet
apiVersion: apps/v1
metadata:
- name: gluster-{{ kube_hostname }}
+ name: gluster-{{ gcs_node_index }}
namespace: {{ gcs_namespace }}
labels:
app.kubernetes.io/part-of: gcs
- Then failed on
Wait for glusterd2-cluster to become ready
TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018 18:39:50 +0000 (0:00:00.083) 0:00:54.274 ******
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).
# oc get pod
NAME READY STATUS RESTARTS AGE
etcd-6jmbmv6sw7 1/1 Running 0 23m
etcd-mvwq6c2w6f 1/1 Running 0 23m
etcd-n92rtb9wfr 1/1 Running 0 23m
etcd-operator-54bbdfc55d-mdvd9 1/1 Running 0 24m
gluster-0-0 1/1 Running 7 23m
gluster-1-0 1/1 Running 7 23m
gluster-2-0 1/1 Running 7 23m
# oc describe pod gluster-1-0
Name: gluster-1-0
Namespace: gcs
Priority: 0
PriorityClassName: <none>
Node: ip-172-31-59-125.us-west-2.compute.internal/172.31.59.125
Start Time: Thu, 25 Oct 2018 18:39:49 +0000
Labels: app.kubernetes.io/component=glusterfs
app.kubernetes.io/name=glusterd2
app.kubernetes.io/part-of=gcs
controller-revision-hash=gluster-1-598d756667
statefulset.kubernetes.io/pod-name=gluster-1-0
Annotations: openshift.io/scc=hostpath
Status: Running
IP: 172.21.0.15
Controlled By: StatefulSet/gluster-1
Containers:
glusterd2:
Container ID: docker://0433446ecbd7a25d5aa9f51f0bd5c3226090850b18d0e63d58d07e47c6fdd039
Image: docker.io/gluster/glusterd2-nightly:20180920
Image ID: docker-pullable://docker.io/gluster/glusterd2-nightly@sha256:7013c3de3ed2c8b9c380c58b7c331dfc70df39fe13faea653b25034545971072
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 25 Oct 2018 19:00:48 +0000
Finished: Thu, 25 Oct 2018 19:03:48 +0000
Ready: False
Restart Count: 7
Liveness: http-get http://:24007/ping delay=10s timeout=1s period=60s #success=1 #failure=3
Environment:
GD2_ETCDENDPOINTS: http://etcd-client.gcs:2379
GD2_CLUSTER_ID: dd68cd6b-b828-4c13-86a4-35c492b5d4c2
GD2_CLIENTADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
GD2_PEERADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
GD2_RESTAUTH: false
Mounts:
/dev from gluster-dev (rw)
/run/lvm from gluster-lvm (rw)
/sys/fs/cgroup from gluster-cgroup (ro)
/usr/lib/modules from gluster-kmods (ro)
/var/lib/glusterd2 from glusterd2-statedir (rw)
/var/log/glusterd2 from glusterd2-logdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hvj7w (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gluster-dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
gluster-cgroup:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
gluster-lvm:
Type: HostPath (bare host directory volume)
Path: /run/lvm
HostPathType:
gluster-kmods:
Type: HostPath (bare host directory volume)
Path: /usr/lib/modules
HostPathType:
glusterd2-statedir:
Type: HostPath (bare host directory volume)
Path: /var/lib/glusterd2
HostPathType: DirectoryOrCreate
glusterd2-logdir:
Type: HostPath (bare host directory volume)
Path: /var/log/glusterd2
HostPathType: DirectoryOrCreate
default-token-hvj7w:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hvj7w
Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/compute=true
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned gcs/gluster-1-0 to ip-172-31-59-125.us-west-2.compute.internal
Normal Pulling 25m kubelet, ip-172-31-59-125.us-west-2.compute.internal pulling image "docker.io/gluster/glusterd2-nightly:20180920"
Normal Pulled 24m kubelet, ip-172-31-59-125.us-west-2.compute.internal Successfully pulled image "docker.io/gluster/glusterd2-nightly:20180920"
Normal Created 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Created container
Normal Started 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Started container
Normal Killing 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Killing container with id docker://glusterd2:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Container image "docker.io/gluster/glusterd2-nightly:20180920" already present on machine
Warning Unhealthy 4m (x21 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Liveness probe failed: Get http://172.21.0.15:24007/ping: dial tcp 172.21.0.15:24007: connect: connection refused
Forgot to mention: Need to overwrite the path of kubectl.
# ansible-playbook -i ~/aaa/gcs.yml deploy-gcs.yml --extra-vars "kubectl=/usr/bin/kubectl" -v
Nice. Thanks for trying it out. :+1:
Since you've changed the SS name, I think you'll also need to change GD2_CLIENTADDRESS and GD2_PEERADDRESS. They should end up w/ the name of the pod that gets spawned by the SS that you renamed... i.e., <the_ss_name>-0. I think this will fix the problem of gd2 being unhealthy.
@JohnStrunk
I changed the name by using the new var, the old var kube_hostname is kept intact.
- name: gluster-{{ kube_hostname }}
+ name: gluster-{{ gcs_node_index }}
# oc get sts gluster-ip-172-31-47-15-us-west-2-compute-internal -o yaml | grep ESS -A1 | grep -v Set
creationTimestamp: 2018-10-25T19:31:43Z
--
- name: GD2_CLIENTADDRESS
value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24007
- name: GD2_PEERADDRESS
value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24008
So that should not be the cause. Did I missed something?
From above:
Environment:
GD2_ETCDENDPOINTS: http://etcd-client.gcs:2379
GD2_CLUSTER_ID: dd68cd6b-b828-4c13-86a4-35c492b5d4c2
GD2_CLIENTADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
GD2_PEERADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
GD2_RESTAUTH: false
Client and peer addresses need to aim at the pod's address. I think you need to update those ENV vars in the template in addition to changing the name field.
The pod's name is now gluster-1-0, but client & peer still point to the old name.
Yes. Now I understand.
New issue:
TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018 19:51:00 +0000 (0:00:00.097) 0:00:54.215 ******
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).
fatal: [master]: FAILED! => {"msg": "The conditional check 'result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)' failed. The error was: error while evaluating conditional (result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)): 'dict object' has no attribute 'kube-node'"}
My inv file:
# cat ~/aaa/gcs.yml
master ansible_host=ip-172-31-43-164.us-west-2.compute.internal
ip-172-31-47-15.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-59-125.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-60-208.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
[kube-master]
master
[gcs-node]
ip-172-31-47-15.us-west-2.compute.internal
ip-172-31-59-125.us-west-2.compute.internal
ip-172-31-60-208.us-west-2.compute.internal
I think I know the problem. It might be groups['gcs-node'] instead of groups['kube-node'].
Will know soon.
Now new issue:
TASK [GCS | GD2 Cluster | Add devices | Set facts] *****************************************************************************
Thursday 25 October 2018 20:15:04 +0000 (0:00:00.041) 0:01:30.000 ******
ok: [master] => {"ansible_facts": {"kube_hostname": "ip"}, "changed": false}
TASK [GCS | GD2 Cluster | Add devices | Add devices for ip] ********************************************************************
Thursday 25 October 2018 20:15:05 +0000 (0:00:00.115) 0:01:30.115 ******
fatal: [master]: FAILED! => {"msg": "u\"hostvars['ip']\" is undefined"}
Will continue tomorrow.
Found the problem again. The playbook has strong restrictions on the node name.
https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L5
This assumes that index 1 has the node name after splitting. Will think about the solution tomorrow.
https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L13 This line uses the json result from the endpoint and then uses hostname as key to get device in inventory. Maybe we need to wait for the fix/response from bz1643191.
The workaround above fix the problem of the playbook only partially. Cannot go through.