phoenix_pubsub Optimize Shard.list and Shard.get_by

Previous list and get_by_key had to go through GenServer to acquire values ets table and replicas information. In case GenServer was processing an update (e.g. heartbeat, track, untrack) then list and get_by_key functions were blocked until it was completed. We saw this behaviour in our cluster where simple list/get_by_key calls were sometimes taking over few hundred milliseconds.

Storing replicas information in an ets table allows us to avoid going through genserver and allows us to process list/get_by_key immediately.

I removed dirty_list function which was not public / exposed and which was trying to resolve the same issue. dirty_list was called dirty because it didn't check for down_replicas. This solution checks down_replicas and doesn't change the api interface.

Update 2019/12/06: We've fully rolled this out to production (50K+ concurrent connections). We also got ~30% drop in CPU usage which I did not expect at all, but that's very good.

Update 2020/01/03: We've hit 70K+ concurrent connections. Everything still looking good.

Update 2021/06/13: Over 200K concurrent connections with this.

This should also resolve #124

Jul 15 '19 11:07 indrekj

Tried this out in production (patched v1.1.2). Forwarded ~15% traffic to the patched instance. Given endpoint does 1-40 get_by_key calls depending on the input. Here are the results: Screen Shot 2019-08-03 at 11 33 59 First two graphs are from the v1.1.2 instances and the last one is from the patched one.

It looks faster, I'd say 20-50% faster. I was actually expecting better results but still better than before.

Aug 03 '19 08:08 indrekj

@chrismccord any chance of getting this reviewed / merged as well?

Oct 28 '19 09:10 indrekj

Any updates on this?

Aug 23 '21 11:08 luislhsc

Optimize Shard.list and Shard.get_by_key