Chaos tests: Implement first batch of tests based on continuous delivery scenario

Open RealAnna opened this issue 4 years ago • 0 comments

The goal of this ticket is to create tests to verify Keptn's high availability KEP48. Using the litmus dashboard create scheduled /randomly repeated tests for the following continuous delivery scenarios:

Network Failure

Introduce 10s delay, 30s delay, and 60s delay.

[ ] Inject network communication latency towards shipyard-controller. Expected behavior:
- I would expect shipyard-controller to fail a sequence? But sequence should terminate anyhow.
[ ] Inject network communication latency towards gitea upstream. Expected behavior:
- I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.
[ ] kill Nats. Expected behavior:
- I would expect shipyard-controller to not receive events during nat downtime, but to not have issue after restart. New sequence should terminate when Nats comes back up.
[ ] Kill api-gateway. Expected behavior:
- I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.

Pod Failure

[ ] Killing a random control plane pod. Expected behavior:
- if the pod is the shipyard-controller and it's the leader, then a new leader election should happen, deployment should be successful, all pods should be up and running.
- if pod is a service involved in the deployment eg. lighthouse, helm, etc replicas should work and deployment should be successful, all pods should be up and running.
[ ] Killing all shipyard-controller pods. Expected behavior:
- new leader and replicas should pick up queued events, deployment should be successful, all pods should be up and running.
[ ] Killing all control pods for either helm/lighthouse/api. Expected behavior:
- The sequence should terminate, sequence status can be failed, all pods should be up and running.
[ ] Killing all jmeter service pods. Expected behavior:
- The sequence should terminate, sequence status must be failed by JMeter, all pods should be up and running.

Container Failure

[ ] Killing distributor in control plane pods. Expected behavior:
- Service fails to respond to shipyard, sequence may fail but it will terminate.
[ ] Killing service container in control plane pods . Expected behavior:
- The sequence should eventually end, either successfully, failing, or timing out.

Node Failure

[ ] cause node restart. Expected behavior:
- Some downtime , but eventually, all keptn core should be up and working.

Feb 23 '22 10:02 RealAnna