weave-gitops fix getting updated user namespaces

Closes

What changed? When attempting to get the namespaces of a user we lock getting the flow of getting the namespaces from the cache and updating it when necessary. We also only update a cluster namespaces to a user namespaces when the cluster is actually found in the cluster namespaces cache.

Why was this change made? This change was made to fix an issue where requests for resources like Kustomizations or helm releases would sometimes return empty result. The issue happens when another request checks on the user namespaces in the cache while a previous request is still building the cache, resulting in a response with partial or empty results. This becomes more apparent when a cluster has an issue(unreachable or has no namespaces for whatever reasons) making empty responses a common occurences. This is due to how the implementation was written:

The user namespaces are calculated from the namespaces in a cluster and then filtered based on user permission and stored in a cache with TTL of 30s
When a request is made for user namespaces it returns it from the cache, if the cache is empty which would happen periodically, it attempts to update the cache and return the data
During that if another request comes it will check on whether the cache is empty or not which means that even if a single key was added it would get the current user namespaces, resulting in partial results. Following requests would have correct responses once the cache is probably built again

The proposed fix is to lock each user request by the user ID. So if a request comes from the same user while another is in process it will only check on the status of the cache after in-flight request is finished. If a cluster has issues retrieving namespaces or the cluster itself is not reachable it would not be set in the cluster namespaces cache but it will still be set in the user namespaces cache with an empty namespaces list. The PR changes this behavior and only add a cluster namespaces to the user cache when it already exists in the cluster namespaces cache.

How was this change implemented? By adding a lock when retrieving the user namespaces. Since there is a lock per user ID we had to store the locks in a map. Currently this is not being cleaned up, which might be something that we need to handle down the line.

How did you validate the change? By adding a non valid cluster to my instance which caused the results of certain APIs to return empty responses consistently before my changes and redoing the same scenario after my changes.

Release notes

Documentation Changes

Sep 22 '22 07:09 alichaddad

Thank you for the excellent explanation of the problem!

This puts the UsersNamespaces lock outside of it in the factory itself, while the ClustersNamespaces are still inside the ClustersNamespaces struct. Is there a reason to not make them be the same?
This - just like the UsersNamespaces ttl cache itself I've now spotted, so perhaps a separate, bigger issue - uses the user.ID as a unique key. However there's various auth forms that leaves user.ID empty, so in that case the cache/lock will not be able to distinguish between different users.

Sep 22 '22 10:09 ozamosi

Thanks for taking a look!

This puts the UsersNamespaces lock outside of it in the factory itself, while the ClustersNamespaces are still inside the ClustersNamespaces struct. Is there a reason to not make them be the same?

Good point we might have to rethink the locking methodology in this part, it seems that there might be too much. In the case of clustersNamespaces it might be a bit different. It gets refreshed periodically consistently without an external trigger. So the need to add an extra lock to it is not the same since the case of the cache getting completely cleared out is rather rare. For userNamespaces this happens constantly and the rebuild is triggered from a user request, since this entails a lot of external actions that might overlap with another request the lock is needed. Does that make sense?

This - just like the UsersNamespaces ttl cache itself I've now spotted, so perhaps a separate, bigger issue - uses the user.ID as a unique key. However there's various auth forms that leaves user.ID empty, so in that case the cache/lock will not be able to distinguish between different users.

This would indeed be a problem. Is there another unique identifier that we can use for users?

Sep 22 '22 12:09 alichaddad

Is there another unique identifier that we can use for users?

There might be .toString() kind of thing on the Principal. But it might strip out the token.. which would make it not great for this. It can be potentially sensitive so don't want it in logs etc, maybe we add .Hash() to the principal instead?

Sep 22 '22 15:09 foot

@alichaddad @foot Is this still in progress?

Oct 30 '23 11:10 lasomethingsomething