bot icon indicating copy to clipboard operation
bot copied to clipboard

Add Consistency Checks

Open HassanAbouelela opened this issue 4 years ago • 4 comments

Every now and then, the help channel cog will desync (mostly due to unavoidable issues from the discord API). This desync takes many forms, most prominent being:

  • Decrease in number of help channels available (#1941)
  • Desync of dynamic help message in #how-to-get-help
  • ~~Pins not being removed after a channel is closed~~ Fixed
  • Cooldown roles not being removed after a session ends (#1940)

These happen often enough that I think we should spend some effort to get them to automatically figure themselves out. I've divided this issue into three separate sub-problems, one for (mostly) each of the points above.

Pins

This one is relatively straightforward. We should get all pins when a channel is moved into dormant, and unpin them. I've checked the rate limit for that endpoint, and it appears to be 1 request / 5s / channel, so I think we're fine.

Cooldown roles

We already keep a list of which authors currently have a channel open, and I believe that does not desync in line with the cooldown roles (though it can be difficult to test). I think that means we can compare the list of people with the role, against the list of channel claimants, and remove any outliers. We should probably run this in some sort of periodic task, possibly every hour? That implementation detail can be left up to the author.

Channels & Channel Names

For the two desync patterns not mentioned yet, a simple cog reload seems to mostly do the trick. It's the go-to method for staff whenever we see a desync, and I estimate we run it bi-weekly right now. It only takes a few seconds (5 according to Akarys), so we should be fine to periodically just reload. Perhaps every hour as well.

@Akarys42 has brought up concerns about redis race conditions, we can discuss those here.

Closing Remarks

None of the mentioned fixes depend on each other, so we tackle each in separate PRs asynchronously.

I'm going to say that if no one has concerns within the next couple days, I'll be approving this issue for the cooldown role and pin fixes. I'll hold off on the last one till we figure out the race issue, but I'll update the issue once it's approved.

HassanAbouelela avatar Oct 21 '21 23:10 HassanAbouelela

I think reloading a cog is a non trivial action that shouldn't be running 48 times a day. It does make the cog unresponsive for 5s and do a bunch of API calls. I am also scared about race conditions where we would write a value to Redis, reload and read it too fast, but I don't know how possible that is.

My opinion would be to do implement the three checks mentioned above as a background task, that runs every once in a while. As for the channel placement itself, I wouldn't mind having the cog reload itself if it do be in a broken state, although we would also have to consider whenever the check is running while the cog is moving a channel. Maybe we can use an asyncio.shield around the block of code that makes a channel available to ensure we don't run the check right in the middle?

Akarys42 avatar Oct 22 '21 04:10 Akarys42

I've done some more experimenting with the cog, and it seems the only thing that takes any time is the async_init task, which makes sense. During that time, the only thing affected is the close command, which gracefully errors out (albeit without explaining anything to the user). Claiming and all that fun non-sense is queued, so it's just a small delay for the user. The only concern would be with the race conditions. I've poked around a bit, and the chances of things going wrong is fairly low, especially since we do unschedule everything when unloading, so we can very easily catch exceptions in any ongoing tasks if they are a problem.

An alternative approach we could take to this is to reload when we detect an incorrect number of help channels. That should not be any more dangerous than what we currently have, because it's literally what we do on a regular basis.

HassanAbouelela avatar Oct 24 '21 17:10 HassanAbouelela

I'm also approving the other two fixes now.

HassanAbouelela avatar Oct 24 '21 17:10 HassanAbouelela

The pins issue is solved by #1909 and #1936.

dementati avatar Nov 07 '21 06:11 dementati