Branding: improve IO resilience
Closes GH-2808
This PR aims to improve the resiliency of the Branding manager cog.
Context
The Branding manager automatically discovers events once per day at midnight -- or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.
If the bot fails to fetch an event, it skips it. This behaviour made sense originally, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this is a non-issue as events are validated in the Branding repository before they reach the main branch. Additionally, this behaviour also affects server errors - if the bot receives a 503 when fetching en event, the event is simply excluded from the result set. This has an unfortunate consequence - if the bot fails to fetch the currently active event, it decides that the event has ended, and will revert the branding to the evergreen (fallback) season.
This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration will be aborted & retried the following day (or it may be retried manually via the sync command). The current state remains in the Redis cache, so the cog will continue working just fine. Additionally, to reduce the likelihood of a discovery iteration being aborted, there will be a retry mechanism on each request. For every request that fails on a server error, the request will be retried up to 5 times with exponential backoff.
Implementation
Adjusting the error handling was easy - I basically just removed the error handling clauses from the repository module. This means that errors occurring during event fetch will now propagate to the cog. The daemon already handles exceptions, and so if a fetch fails, the daemon will simply log it and sleep until the next midnight.
Another way to trigger the event discovery is via the sync command. This also already handles errors - I just removed a redundant check for when no active event was found, which will now result in an exception.
Finally, the calendar refresh cmd used to automatically show the calendar after running discovery. Because discovery can now fail (well, it always could, but the error used to be hidden), it now responds with an info embed:
The user can then use check the calendar via the calender cmd. I think this makes more sense, but I'm open to feedback.
Within the repository module, I added a custom exception for GitHub server errors, and revamped the response status checks. I've added the MIT-licensed backoff library to avoid implementing the retry mechanism myself, but if we would rather avoid the dependency, then a custom implementation is also possible. The backoff decorator basically retries the decorated function if it raises the specified exception, up to the configured amount of times.
If you would like to test the error handling, you can manually raise an error in _raise_for_status. You can force a 401 if you use a bad access token, but forcing a 500 isn't really possible, so manually raising in the code is probably the best option. Feel free to get in touch if you need help with testing.
Can we maybe use tenacity instead ?
The library is popular, resilient and offeres the same features & even more which may become helpful one day.
Most importantly, it's still maintained & gets updates, unlike backoff
@shtlrs Thanks for the suggestion, I will happily switch to tenacity. Though I have a question about the licensing.
The tenacity lib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.
The
tenacitylib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.
I'm not an expert on licenses, but I don't think that should be necessary, the third party license file we have is currently just used for code we've copied directly into our project rather than dependencies (which I think work in a slightly different way). I believe we already have some Apache licensed dependencies, and there are other projects like discord.py that are MIT licensed but use Apache licensed dependencies without including the Apache license in their repo.
@wookie184 Makes sense, thanks for the answer. I'll work on the suggested improvements.