containers [bitnami/etcd] Fix unbound host variable when disaster recovery is enabled

Description of the change

When you start a pod at least after 3.5.4-debian-11-r14 disaster recovery enabled, the init script fails because of an unbound variable host. This adds the default port as a variable and makes sure that it uses hostname -f to get the host of the current container/pod.

Benefits

The server can start again

Possible drawbacks

Not sure of any

Applicable issues

Additional information

Sep 07 '22 12:09 jaysonsantos

Hi @jaysonsantos,

Could you do a rebase of the main?

Sep 08 '22 11:09 Mauraza

@Mauraza done

Sep 08 '22 12:09 jaysonsantos

Hi @jaysonsantos,

Could you add to this thread a way to reproduce the issue and the logs related to when the pod fails?

Sep 08 '22 14:09 Mauraza

Hi @Mauraza, that was tricky to simulate, but you can see the behavior below. It mimics a state where the local data is broken and the current member has to restore the data from a snapshot.

mkdir -p etcd/{snapshots,data} && echo does not matter | tee  etcd/{data/member_id,snapshots/.disaster_recovery} \
&& docker run -u $(id -u) --name etcd -e ETCD_DISABLE_PRESTOP=yes \
-e ETCD_ACTIVE_ENDPOINTS=does-not-matter \
-e ETCD_INITIAL_CLUSTER=http://localhost:2380,http://fake-down-server:2380 \
-e BITNAMI_DEBUG=yes -e ETCD_DISASTER_RECOVERY=yes \
-e ALLOW_NONE_AUTHENTICATION=yes --rm -it \
-v $PWD/etcd/snapshots:/snapshots \
-v $PWD/etcd/data:/bitnami/etcd/data \
bitnami/etcd:3.5.4-debian-11-r33

with the output:

etcd 18:41:47.60
etcd 18:41:47.62 Welcome to the Bitnami etcd container
etcd 18:41:47.65 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 18:41:47.67 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 18:41:47.69
etcd 18:41:47.72 INFO  ==> ** Starting etcd setup **
etcd 18:41:47.84 INFO  ==> Validating settings in ETCD_* env vars..
etcd 18:41:47.88 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 18:41:47.92 INFO  ==> Initializing etcd
etcd 18:41:47.95 INFO  ==> Generating etcd config file using env variables
etcd 18:41:48.17 INFO  ==> Detected data from previous deployments
etcd 18:41:48.22 INFO  ==> The member will try to join the cluster by it's own
/opt/bitnami/scripts/libetcd.sh: line 448: host: unbound variable

the same input with my fix renders the following:

docker run -u $(id -u) --name etcd -e ETCD_DISABLE_PRESTOP=yes -e ETCD_ACTIVE_ENDPOINTS=does-not-matter -e ETCD_INITIAL_CLUSTER=http://localhost:2380,http://fake-down-server:2380 -e BITNAMI_DEBUG=yes -e ETCD_DISASTER_RECOVERY=yes -e ALLOW_NONE_AUTHENTICATION=ye
s --rm -it -v $PWD/etcd/snapshots:/snapshots -v $PWD/etcd/data:/bitnami/etcd/data etcd-fix
etcd 18:44:19.23
etcd 18:44:19.26 Welcome to the Bitnami etcd container
etcd 18:44:19.28 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 18:44:19.31 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 18:44:19.33
etcd 18:44:19.35 INFO  ==> ** Starting etcd setup **
etcd 18:44:19.48 INFO  ==> Validating settings in ETCD_* env vars..
etcd 18:44:19.51 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 18:44:19.56 INFO  ==> Initializing etcd
etcd 18:44:19.59 INFO  ==> Generating etcd config file using env variables
etcd 18:44:19.82 INFO  ==> Detected data from previous deployments
etcd 18:44:19.86 INFO  ==> The member will try to join the cluster by it's own
etcd 18:44:20.21 DEBUG ==> Last member to recover from the disaster!
etcd 18:44:20.25 WARN  ==> Cluster not responding!
etcd 18:44:20.31 ERROR ==> There was no snapshot to restore!

but if there was a valid snapshot, it would just keep going with that.

hammaya-relaxed

Sep 08 '22 18:09 jaysonsantos

Hi @jaysonsantos,

The environment variable ETCD_INITIAL_CLUSTER https://etcd.io/docs/v3.1/op-guide/configuration/#--initial-cluster only supports one URL. Could you try with ETCD_ADVERTISE_CLIENT_URLS instead?

mkdir -p etcd/{snapshots,data} && echo does not matter | tee  etcd/{data/member_id,snapshots/.disaster_recovery} \
 && docker run -u $(id -u) --name etcd \
-e ETCD_DISABLE_PRESTOP=yes \
-e ETCD_ACTIVE_ENDPOINTS=does-not-matter \
-e ETCD_ADVERTISE_CLIENT_URLS=http://localhost:2380,http://fake-down-server:2380 \
-e BITNAMI_DEBUG=yes \
-e ETCD_DISASTER_RECOVERY=yes \
 -e ALLOW_NONE_AUTHENTICATION=yes \
--rm -it \
-v $PWD/etcd/snapshots:/snapshots \
-v $PWD/etcd/data:/bitnami/etcd/data bitnami/etcd:3.5.4-debian-11-r33

Sep 14 '22 09:09 Mauraza

Hi there @Mauraza that config I got from a running container that was created by the helm chart, maybe it should always set as one value?

Sep 14 '22 11:09 jaysonsantos

This is the place where it sets more than one url: https://github.com/bitnami/charts/blob/d36311748078c08e2ad5a8cc64b2c02007304636/bitnami/etcd/templates/statefulset.yaml#L194-L199

Sep 14 '22 11:09 jaysonsantos

Hi @jaysonsantos,

that is right, for that you need to initialize the environment variable as ETCD_INITIAL_CLUSTER=one=http://localhost:2380,fake=http://fake-down-server:2380. You can check this docker-compose as an example. https://github.com/bitnami/containers/blob/013e48a91036db911a706a0ed4aa133de35ba772/bitnami/etcd/docker-compose-cluster.yml#L14

Sep 14 '22 12:09 Mauraza

Hi yes but, the way I came up with those variables was to mimic the state that the helm chart renders the containers and the fix is to avoid that from happening. What happens there is, when disaster recovery is enabled and the server has to do it, it will break after that r11 version. In the end, the script was just a mean of showcasing the error and the fix.

Sep 14 '22 12:09 jaysonsantos

Hi @jaysonsantos,

could you share the logs of the error and the values of the chart? I will try to reproduce

Sep 19 '22 07:09 Mauraza

Hi @jaysonsantos,

could you share the logs of the error and the values of the chart? I will try to reproduce

Sep 19 '22 07:09 Mauraza

Sure, I will try and deploy another instance of it and reproduce the error

Sep 19 '22 07:09 jaysonsantos

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

Oct 07 '22 02:10 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Pull Request. Do not hesitate to reopen it later if necessary.

Oct 13 '22 01:10 github-actions[bot]