feat: Use Database availability as indicator for /healthz response
In one of our landscapes we recently saw, that the UAA /healthz endpoint will respond with ok even if the UAA Database is completely down (and hence none of the other requests to the UAA will work).
This PR aims to consider the DB availability in the response of the /healthz endpoint. As the requests to /healthz are time critical and should not take over a second, the actual DB check is done in a scheduled background task that is trying to establish a new DB connection and execute a statement. This task is executed every 10 seconds by default (time is configurable) and the /healthz endpoint will read the flag whether the last connection was successful.
When the DB connection works, the status is kept as it was before. When there has not been a DB check (usually only during startup), the status will still be 200, but the message will change until there has been a DB connection check. A failed DB connection will result in a 503 status code along with an error message, indicating that the DB connection check failed.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/187009282
The labels on this github issue will be updated when the story is started.
Looking better. Still curious about your thoughts on the long delay in response at startup.
Looking better. Still curious about your thoughts on the long delay in response at startup.
When starting the uaa locally I have the following behavior:
- when the cargoRunLocal step begins there is no response yet for a few seconds (<5s, i suppose this is where the container is started)
- afterwards any request will wait for a response until the container becomes healthy (e.g. flyway finished work, all beans are created, all required spring magic is done)
- after about 30 seconds the request will return with a status, in my case it was already returning that the DB connection was successful (so the DB check already finished until my request came to the point where the result is checked)
Is this the same behavior for you or do you see any other behavior here?
From my perspective the above is the normal behavior, it has always been like this and this is not only like this for the healthz endpoint, but also if you e.g. want to access the login page during startup. i.e. this is nothing that is changed by this PR, the startup just takes some time where requests are accepted, but not yet processed. Also the 33s is about the amount of time that I would still see as normal.
@tack-sap My apologies, I didn't think to test the current behavior on develop, and I do see that this delay is already how the healthz endpoint works.