keptn icon indicating copy to clipboard operation
keptn copied to clipboard

Chaos tests: Implement first batch of tests based on continuous delivery scenario

Open RealAnna opened this issue 4 years ago • 0 comments

The goal of this ticket is to create tests to verify Keptn's high availability KEP48. Using the litmus dashboard create scheduled /randomly repeated tests for the following continuous delivery scenarios:


Network Failure

Introduce 10s delay, 30s delay, and 60s delay.

  • [ ] Inject network communication latency towards shipyard-controller. Expected behavior:
    • I would expect shipyard-controller to fail a sequence? But sequence should terminate anyhow.
  • [ ] Inject network communication latency towards gitea upstream. Expected behavior:
    • I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.
  • [ ] kill Nats. Expected behavior:
    • I would expect shipyard-controller to not receive events during nat downtime, but to not have issue after restart. New sequence should terminate when Nats comes back up.
  • [ ] Kill api-gateway. Expected behavior:
    • I would expect shipyard-controller to fail the sequences. But sequence should terminate anyhow.

Pod Failure

  • [ ] Killing a random control plane pod. Expected behavior:
    • if the pod is the shipyard-controller and it's the leader, then a new leader election should happen, deployment should be successful, all pods should be up and running.
    • if pod is a service involved in the deployment eg. lighthouse, helm, etc replicas should work and deployment should be successful, all pods should be up and running.
  • [ ] Killing all shipyard-controller pods. Expected behavior:
    • new leader and replicas should pick up queued events, deployment should be successful, all pods should be up and running.
  • [ ] Killing all control pods for either helm/lighthouse/api. Expected behavior:
    • The sequence should terminate, sequence status can be failed, all pods should be up and running.
  • [ ] Killing all jmeter service pods. Expected behavior:
    • The sequence should terminate, sequence status must be failed by JMeter, all pods should be up and running.

Container Failure

  • [ ] Killing distributor in control plane pods. Expected behavior:
    • Service fails to respond to shipyard, sequence may fail but it will terminate.
  • [ ] Killing service container in control plane pods . Expected behavior:
    • The sequence should eventually end, either successfully, failing, or timing out.

Node Failure

  • [ ] cause node restart. Expected behavior:
    • Some downtime , but eventually, all keptn core should be up and working.

RealAnna avatar Feb 23 '22 10:02 RealAnna