cloud_controller_ng
cloud_controller_ng copied to clipboard
Rolling deployment: cancel+restart can lead to broken app
Issue
If a failing rolling deployment is canceled and immediately a new rolling deployment is triggered (see example below), then wrong application web processes are terminated which leads to a broken/unavailable app even though it was healthy before.
Rolling deployments should not lead to an unavailable application in case the new app version can't be started.
Context
cf-deployment v21.11.0 capi-release
D047883@VR9NDXJ7JP testapp % cf version
cf version 8.5.0+73aa161.2022-09-12
This is probably also the root cause for cli #2257.
Steps to Reproduce
# current dir contains an index.html file
# step 1
cf push testapp -m 128M -b binary_buildpack -c "python3 -m http.server 8080"
# app is running healthy
# break the app by a too large env var
cf set-env testapp too-large-env $(printf '%*s' 200000 | tr ' ' x)
# rolling restart (won't succeed)
# step 2
cf restart testapp --strategy rolling
# cancel deployment while step 2 is running and immediately restart
# step 3
cf cancel-deployment testapp && cf restart testapp --strategy rolling
# result: app is unavailable
Current result
App is unavailable because the healthy web process from step1 was stopped.
By analysing CC logs (of api and scheduler VMs) and reading the CC coding:
- step 1 (initial push) creates the app with process1 running 1 healthy instance (which uses the app guid as process guid for historic reasons)
- step 2 (rolling restart after assigning bad env var) creates deployment1 which in turn creates process2 (no instance yet)
- scheduler:cc_deployment_updater > scale_deployment for deployment1 tries to create one instance for process2 which fails, this repeats
- step 3 (cf cancel-deployment testapp && cf restart testapp --strategy rolling):
- cancel-deployment marks deployment1 as CANCELING
- restart creates deployment2 and process3 (no instance yet)
- scheduler:cc_deployment_updater > scale_deployment for deployment2 fails when trying to get the app lock
- not yet fully understand why, maybe due to parallel scale or cancel for d1
- anyway, it doesn't matter why scale_deployment for deployment2 fails -> process3 won't start and cc_deployment_updater retries until canceled
- scheduler:cc_deployment_updater > cancel_deployment for deployment1
- finds bad prio_web_process=process3, should have found process1
- related CC coding, process3 != d1.deploying_web_process (=process2) and newer than process1
- cancel_deployment stops process1 (the healthy one) and process2 (which it should stop) and keeps process3 (which was not yet started and will never run) -> app is down
Expected result
App remains available even though all rolling deployment attempts fail.
- cancel_deployment for deployment1 shall not delete process1 but only process2 (the new process of deployment1 which never got healthy)
- (acceptable result IMHO): rolling restart in step 2 could have failed without creating deployment2
Possible Fix
Just ideas, need more discussion:
- ensure that there is only one active deployment at a time, i.e.
cf restart testapp --strategy rollingof step 2 should fail fast without creating deployment2 since deployment1 is still in progress (CANCELING) - more elaborated calculation of prio_web_process that can handle multiple active deployments (e.g. deployment model could store the prior_web_process when it gets created), unclear if this can be done in a waterproof way if multiple active deployments are allowed