[Feature request]: Automate management group scope deployment clean up
Description
Deployment quota often exceeding at management group scope.
Management group scope deployments fail when deployment count exceeds 800.
To overcome this, deployments that are no longer needed should be deleted from the history Ref Management group limits. This is currently being done manually.
The failure frequency increased after enabling telemetry settings, doubling up the number of deployments per pipeline run.
Error log example: https://github.com/Azure/ResourceModules/runs/6043758601?check_suite_focus=true#step:4:442
I guess we could debate whether that is a bug, or an enhancement. But it's in any case a pain ;)
@eriqua, to not run into API throttling limits, I suggest to not use the default PS-cmdlet (Remove-AzResourceGroupDeployment), nor the default REST-API call, but instead use the bulkDelete used by the AzurePortal (with maximum 100 items at a time):
Request URL: https://management.azure.com/batch?api-version=2020-06-01 Request Method: POST Header Content Type: Content-Type: application/json
Payload:
{
"requests": [
{
"content": {
"EntityIds": [
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/pid-47ed15a6-730a-4827-bcb4-0fd963ffbd82-l6ql6gmfsjros",
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/roleAssignments-20220419T1104096929Z",
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/roleDefinitions-20220419T1104041183Z",
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/pid-47ed15a6-730a-4827-bcb4-0fd963ffbd82-wpnvthqx5ddns",
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/wpnvthqx5ddns-RoleDefinition-MG-Module",
"/providers/Microsoft.Management/managementGroups/<managementGroupId>/providers/Microsoft.Resources/deployments/pid-47ed15a6-730a-4827-bcb4-0fd963ffbd82-iaxflwtdbhwgk"
],
"Type": "Default"
},
"httpMethod": "POST",
"name": "99b3bff9-bcd6-4ebc-9546-be0051a13018",
"requestHeaderDetails": {
"commandName": "HubsExtension.armbulkdelete"
},
"url": "/bulkdelete?api-version=2014-04-01-preview"
}
]
}
To implement the bucket size we can use the following little snippet:
function Split-Array {
[CmdletBinding()]
param(
[Parameter(Mandatory)]
[string[]]$InputArray,
[Parameter(Mandatory = $false)]
[int] $SplitSize = 100
)
if ($splitSize -ge $InputArray.Count) {
return $InputArray
}
else {
$res = @()
for ($Index = 0; $Index -lt $InputArray.Count; $Index += $SplitSize) {
$res += , ( $InputArray[$index..($index + $splitSize - 1)] )
}
return $res
}
}
You can use it like this:

@eriqua @MrMCake I am thinking we can tackle this in a way that doesn't get embedded into our deployment / validation scripts. But a 'maintenance' type pipeline that runs either at specific intervals or whenever certain PRs are completed. This way it's decoupled and is portable to be used elsewhere. Thoughts?
Is this duplicate of https://github.com/Azure/ResourceModules/issues/1342 ?
@MariusStorhaug reading the other issue, it is something else
Hey @ahmadabdalla , I just created a PR for a utility that I created for this still manual Deployment-Removal. Whatever we do (e.g. have a re-curring pipeline running), we can use this script to clean up. For the time being I think it is just useful to have around.
Reopening since the issue is about automating the deletion through pipelines
I was thinking about this a bit. I think we can do this in one of 2 ways:
- Either we invoke the cleanup before/after each pipeline run (i.e. somewhere in the validate deployment template/action) OR
- We add a platform pipeline that runs e.g. every week.
Personally I'd vouch for option 2, with some logic like
- Only remove successful deployments
- Unless the deployment happened more than 4? weeks prior to the pipeline run.
This should be pretty easy to implement as we have the function available and would only need to filter the deployments out accordingly.
Agree with the proposal.
Adding a couple of further points (the second most probably being a separated issue):
-
The platform pipeline should be implemented for both ADO and GH. On our side we should avoid to have them scheduled at the same time.
-
I'd anyway suggest to have option and clear guidelines to opt out from management group level operations, since many wouldn't agree to have service principal permissions higher than subscription scope. Users falling into this category should remove e.g. the management group module and related validation pipeline, clean up authorization modules happening at multiple scopes (e.g. rbac, policies). Of course for those users the problem related to this issue wouldn't occur. But they shouldn't register the cleanup platform pipeline.
My preference for keeping the cleanup separated from the validation pipelines is also due to the last point above.