keptn
keptn copied to clipboard
Chaos tests: Implement first batch of tests based on continuous delivery scenario
The goal of this ticket is to create tests to verify Keptn's high availability KEP48. Using the litmus dashboard create scheduled /randomly repeated tests for the following continuous delivery scenarios:
Network Failure
Introduce 10s delay, 30s delay, and 60s delay.
- [ ] Inject network communication latency towards shipyard-controller. Expected behavior:
- I would expect shipyard-controller to fail a sequence? But sequence should terminate anyhow.
- [ ] Inject network communication latency towards gitea upstream. Expected behavior:
- I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.
- [ ] kill Nats. Expected behavior:
- I would expect shipyard-controller to not receive events during nat downtime, but to not have issue after restart. New sequence should terminate when Nats comes back up.
- [ ] Kill api-gateway. Expected behavior:
- I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.
Pod Failure
- [ ] Killing a random control plane pod. Expected behavior:
- if the pod is the shipyard-controller and it's the leader, then a new leader election should happen, deployment should be successful, all pods should be up and running.
- if pod is a service involved in the deployment eg. lighthouse, helm, etc replicas should work and deployment should be successful, all pods should be up and running.
- [ ] Killing all shipyard-controller pods. Expected behavior:
- new leader and replicas should pick up queued events, deployment should be successful, all pods should be up and running.
- [ ] Killing all control pods for either helm/lighthouse/api. Expected behavior:
- The sequence should terminate, sequence status can be failed, all pods should be up and running.
- [ ] Killing all jmeter service pods. Expected behavior:
- The sequence should terminate, sequence status must be failed by JMeter, all pods should be up and running.
Container Failure
- [ ] Killing distributor in control plane pods. Expected behavior:
- Service fails to respond to shipyard, sequence may fail but it will terminate.
- [ ] Killing service container in control plane pods . Expected behavior:
- The sequence should eventually end, either successfully, failing, or timing out.
Node Failure
- [ ] cause node restart. Expected behavior:
- Some downtime , but eventually, all keptn core should be up and working.