cli icon indicating copy to clipboard operation
cli copied to clipboard

`cf push --strategy rolling` should be zero-downtime for apps with long-running processes

Open pippinhio opened this issue 4 years ago • 3 comments

What's the user value of this feature request? Calling cf push --strategy rolling will be zero-downtime even if at the time the old app is processing long-running requests (i.e. requests that take more than a few seconds to complete).

Who is the functionality for? Everybody who wants zero-downtime upgrades of an app where processes can take more than a few seconds to complete.

How often will this functionality be used by the user? Everytime a user, who wants zero-downtime upgrades of an app with long-running processes, pushes a new version.

Who else is affected by the change? The feature won't break anything. Other users will not be affected.

Is your feature request related to a problem? Please describe. Currently, cf push --strategy rolling deletes the old app just a few seconds after the old app has stopped to accept any new requests (i.e. after the productive route has been unmapped from the old app). However, if the old app has received a request just before and the old app needs more than a few seconds to handle that request, the old app is now deleted before it could respond to that request. Consequently, the end-user of the app will experience a brief downtime.

Describe the solution you'd like The command cf push --strategy rolling accepts a further parameter that controls for how long the old app is kept running after the productive route has been unmapped.

Describe alternatives you've considered

  • The plug-in autopilot (link) has the same issue.

  • The plug in cf-blue-green-deploy (link) by default keeps the old app running. The user can then delete the old app after enough time has passed. However, the plug-in has other issues:

    • For instance, the plug-in creates a temporary route during deployment that's not space-specific. So the user can only deploy to one space at a time. Further, if the deployment fails, the plug-in doesn't clean up the temporary route. Deployment to other spaces is now blocked permanently (route is taken).

    Note that the plug-in is recommended on https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html.

  • Doing zero-downtime deployment by hand as described in https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html (i.e by calling cf push, cf map-route, ...) is complicated and, therefore, error-prone and risky.

Additional context In order to reproduce, create the following simple Python app:

# file hello.py
from flask import Flask
import os
import time

app = Flask(__name__)
port = int(os.getenv("PORT", 9099))

@app.route('/')
def hello():
  time.sleep(10)
  return "OK"

if __name__ == '__main__':
  app.run(host='0.0.0.0', port=port)
# file manifest.yml 
applications:
- name: myapp
  memory: 128MB
  disk_quota: 256MB
  routes:
  - route: my-test-route.my.domain.com
  buildpack: python_buildpack
  command: python hello.py
# file requirements.txt
Flask
gunicorn

Push with cf push --strategy rolling and send requests to the app while the deployment is running. A few of those requests will fail with

502 Bad Gateway: Registered endpoint failed to handle the request.

pippinhio avatar Apr 23 '21 15:04 pippinhio

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177889263

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Apr 23 '21 15:04 cf-gitbot

+1 - We've got users hitting the same problem too. Would be great to have an extra parameter for how long to keep the old instance running for after the new one has started.

jimconner avatar Jun 14 '21 10:06 jimconner

we also have the same problem. Our only salvation up to now was the plugin bluemixgaragelondon which is not being maintained anymore (does not support CF API v3) and forces us to find an alternative as quickly as possible. We reached a deal up to now that we can continue using the CF API v2 without rate limiting until August. After that we will be de facto limited to 10 deployments an hour. Soon API v2 may just be removed completely. We are looking for migration strategies, but all the options look extremely grim. @jimconner do you also have the same problem? How are you reacting to it?

DanieleStrafile avatar Jun 13 '23 07:06 DanieleStrafile