Optimize Shard.list and Shard.get_by_key
Previous list and get_by_key had to go through GenServer to acquire values ets table and replicas information. In case GenServer was processing an update (e.g. heartbeat, track, untrack) then list and get_by_key functions were blocked until it was completed. We saw this behaviour in our cluster where simple list/get_by_key calls were sometimes taking over few hundred milliseconds.
Storing replicas information in an ets table allows us to avoid going through genserver and allows us to process list/get_by_key immediately.
I removed dirty_list function which was not public / exposed and which was trying to resolve the same issue. dirty_list was called dirty because it didn't check for down_replicas. This solution checks down_replicas and doesn't change the api interface.
Update 2019/12/06: We've fully rolled this out to production (50K+ concurrent connections). We also got ~30% drop in CPU usage which I did not expect at all, but that's very good.
Update 2020/01/03: We've hit 70K+ concurrent connections. Everything still looking good.
Update 2021/06/13: Over 200K concurrent connections with this.
This should also resolve #124
Tried this out in production (patched v1.1.2). Forwarded ~15% traffic to the patched instance. Given endpoint does 1-40 get_by_key calls depending on the input. Here are the results:
First two graphs are from the v1.1.2 instances and the last one is from the patched one.
It looks faster, I'd say 20-50% faster. I was actually expecting better results but still better than before.
@chrismccord any chance of getting this reviewed / merged as well?
Any updates on this?