No caching of Azure client instances leads to constant overhead of configuring/connecting

Open drub0y opened this issue 5 years ago • 0 comments

So I'm noticing something that seems like a potential anti-pattern where Azure client instances are not being cached and instead constantly being re-materialized all the way from configuration up every single time.

It starts with the AzureIAuthJanitorProviderExtensions::GetAzure method. This method uses the fluent Azure SDK to construct the base IAzure instance which includes network calls to authenticate against, retrieve subscriptions and ultimately "bind" to the target Azure subscription. This also includes allocating a new RestClient instance which, if you dive deep enough into it, you'll find actually allocates a new HttpClientHandler each time. This is the root cause of the well known HttpClient anti-pattern which can/will ultimately result in TCP connection resource starvation.

By itself the current design of this method is probably as good as it can be, but it puts the onus on the caller, in this case the various IAuthJanitorProvider implementations, to know they need to cache the resulting client instance... spoiler alert tho: none of them do. 😞

Now, that we understand the implementation and the costs here, let's dive into why it ends up being a problem. If you go look at the various IAuthJanitorProvider implementations you will find they constantly just call the extension method via this.GetAzure() whenever they want to do something against their respective Azure resources. For example, here's one such usage from the AppSettingsWebAppApplicationLifecycleProvider class which illustrates one of the bigger problems that comes from the current design:

https://github.com/microsoft/AuthJanitor/blob/eb2c5e614487a228555a5f832509d04aae627b23/AuthJanitor.Providers.AppServices/WebApps/AppSettingsWebAppApplicationLifecycleProvider.cs#L62-L86

In this method you see a call to GetDeploymentSlot first, this actually calls GetWebApp which calls GetAzure(). It then later calls GetWebApp directly which again calls GetAzure. So within that one method call, you've gone through the full creation, configuration and subsequent leaking of the underlying TCP connections. This is just one highly localized example where the problem surfaces. Here's another one from WebApplicationLifecycleProvider::Test:

https://github.com/microsoft/AuthJanitor/blob/eb2c5e614487a228555a5f832509d04aae627b23/AuthJanitor.Providers.AppServices/WebApps/WebAppApplicationLifecycleProvider.cs#L13-L23

Now, we see the GetDeploymentSlot method being called three distinct times here in just this one method.

But wait, it gets worse. 😫 If you understand the rekeying workflow of the AuthJanitor every provider is retrieved and goes through the following method call sequence:

Test -> GetSecretToUseDuringRekeying -> BeforeRekeying -> Rekey -> CommitNewSecrets -> AfterRekeying -> OnConsumingApplicationSwapped

Each one of those methods, if you dig into each provider implementation, usually makes one or more calls to GetAzure(). If you extrapolate out N providers calling GetAzure() M times per workflow every T increments of time (right now 2mins) you can see how this becomes problematic pretty quickly.

There are several ways to solve this, but I would personally want to think about it more before recommending a solution. I just wanted to surface this problem so that we understand it exists right now and can begin to have discussions about.

May 14 '20 20:05 drub0y