posthog WIP: How do we implement resilient feature flags

The what & why: https://github.com/PostHog/meta/pull/74

The how: this issue. Going deep into exactly how we'll implement this and reduce the risk of everything blowing up. This touches a very sensitive code path, so making sure there's zero downtime is important.

This issue seeks to clarify for everyone how we'll get there (and for me to think through how to do it).

Broadly, the things we need to do are:

[ ] Introduce caching on decide, for 2 things: (1) Project token to team. (2) teamID to feature flag definitions.
[ ] Figure out how to update caches & when to invalidate. Open question: How do we ensure caches are always populated? It's going to be annoying if cache isn't populated and postgres goes down, leaving us to die.
[ ] Figure out the code paths: do we always default to cache first, or ~~keep the cache just as a backup~~? Depends partially on the above & the guarantees we have on the cache.
[x] Figure out the semantics of 'best-effort flag calculation': given that postgres is down, what all flags do we want to calculate & how will this work?
[ ] Update client libraries to use the new decide response & update only flags sent by decide, keep the old ones as is, unless there were no errors during computation, in which case replace all flags.
[ ] .. And don't obliterate flags on decide 500 responses.

For reducing risk, it makes sense to break down server changes into discrete independent parts

[ ] project API key -> teamID caching
[ ] teamID -> flag definitions caching
[ ] best effort flag evaluation

Jan 09 '23 13:01 neilkakkar

re: caching, I'm thinking of introducing post-save/update hooks, which ensure that when a flag is updated, the cache is updated as well.

These can sometimes fail, which leads to the cache being out of date, which isn't great. We could introduce a ttl for this, but the new problem becomes: if the ttl is too low (like, say 5 minutes); then the DB going down means we're going to go down anyway, as the information will be lost. If it's too high, chances are things will be stale for a longer while 🤔 .

It's probably better to have this in the update/create flow itself, guaranteeing that the request is a success only when the cache is updated too. Yep, this seems better. We can have longer ttls with this as well.

Making some constraints explicit:

We can't really go for a cache-aside strategy, where we read from the cache, and fallback to the db if it misses (well, not by default, and not all the time at least), since the point is to defend against the db going down sporadically because too many connections etc. etc.
Given the above, we necessarily want to populate the cache on startup. 3. We can possibly subvert this by relaxing the constraint above & treating this like a regular cache. But that is not a worthy trade-off imo, as it destroys the guarantee we were looking for in the first place.

Do we need TTLs at all then? Not really, since there's no big risk of things going out of sync.

Jan 09 '23 14:01 neilkakkar

regarding size limits, the current feature flag table (which will effectively be cached) is less than 3 MB in size:

SELECT pg_size_pretty(pg_total_relation_size('posthog_featureflag'))

so, we're good here size wise for a long time.

project token to teamID is even smaller, at O(number of teams)

Jan 09 '23 14:01 neilkakkar

Have we considered caching outside of our main app deployment? Ie in the case that our app is totally down (lb failure/misconfig, dodgy deploy, uncaught logic problem or otherwise), customers can still resolve flags for the last state they were in

There's definitely a bunch we could do within AWS that could make this incredibly resilient

Jan 13 '23 13:01 ellie

Great idea! Haven't yet, but I expect it to be a lot more plug-and-play, changing redis servers once the basic code is in place (correct me if I'm wrong!)

At that stage, would love some support from infra to making this more robust.

Jan 13 '23 13:01 neilkakkar

Ah, wait, no, if the entire app deployment is down, /decide api endpoint is down too, so the above doesn't help 🤔 .

Isn't this effectively then having a second app deployment? Since we can't/don't want to cache responses, but the flag definitions.

Jan 13 '23 13:01 neilkakkar

Do you think updating the flutter sdk to support normal flags and the new decide can be part of the tasks? There is an open issue about it: #12222

Feb 24 '23 21:02 ayr-ton

A PR is already out for that, should be going in soon 👀

Feb 24 '23 22:02 neilkakkar