taskcluster Combine worker-manager provisioner and worker scanner in to a single processes

The worker-manager has two processes that could be combined:

worker-manager-provisioner - Looks at the pending tasks in the queue, and spins up workers (and maybe spins down)
worker-manager-workerscanner - Checks if provisioned workers are still running. Workers are often configured to shut themselves down if they have no tasks to claim.

According to @imbstack, this design was informed by a previous AWS-only provisioner, which was too slow and would get itself stuck. The new provisioner is designed to launch many workers in parallel, as quickly as possible, and not get stuck. However, this seems to lead to over-provisioning.

His suggestion was to combine the two into a single process that runs the three steps repeatedly:

Scan the active workers to update state
Check the queues
Start an appropriate amount of workers

This would ensure that scanning and provisioning workers remain in sync.

Sep 02 '21 16:09 jwhitlock

From FxCI logs I can see that:

provisioner is currently quite fast, seconds to minutes (worst cases)
scanner is extremely slow on FxCI - 10min to 30min and above

If we combine those now, it would mean that provisioning part of it might get delayed, and new task would be waiting for the scanner part for too long.

Plus there is a huge disproportion in providers, for example at this very moment fxci runs:

provider	running tasks	%
aws	4250	~84%
azure	696	~13%
gcp	108	~2%

Those values will probably change, but still, some providers will run longer.

My suggestions:

Split processes Run separate scanner/provisioner for each cloud (providers would be isolated and would not block each other)

Optimize scanner Minimize scanning time. For example call list endpoints to enumerate all running instances and avoid doing multiple calls - GCP instances/list - Azure list-all - AWS DescribeInstances

And then to find instances that were missing in api response but are considered as running according to database, and additionally query those, or set to terminated

Mar 24 '22 13:03 lotas

Was able to dive deeper into one of those slower scanner loops. Seems like it was caused by a sudden spike on some types of azure win10 workers. having around 40-50 workers for that type, then suddenly +100

which made whole scan loop take 1h30min:

And loop slows down at the same time, query for get_non_stopped_workers_quntil:

without conclusions for now, tbc

Mar 28 '22 16:03 lotas

Couple of interesting observations came out when looking at one of the slow loops:

we didn't hit rate limiting in cloud APIs in recent weeks, which means that we are not being slowed down by cloud provider
Some VMs take 2h+ to deprovision (spread across multiple scan loops)
Azure API calls for 2h+ slow loop:

type	count
deprovision `VM`	542
deprovision `NIC`	404
deprovision `disks`	344
deprovision `IP`	288
-	-
provisioning `IP`	1265
provisioning `NIC`	750
provisioning `VM`	384
-	-
query resource	1927

296 vm's took 3+ loops to provision ip only, 88 vm's waited 4 loops

Azure API calls for 1h+ slow loop:

type	count
deprovision `VM`	189
deprovision `NIC`	160
deprovision `disks`	172
deprovision `IP`	128
-	-
provisioning `IP`	349
provisioning `NIC`	150
provisioning `VM`	36
-	-
query resource	499

36 vm's took 3 loops to provision ip only

This corresponds to the order of operations:

Deprovision: kill VM, kill NIC, kill Disks, kill IP
Provision: request IP, request NIC, request VM

Based on those values I can guess that main bottleneck in provisioning of new instances is getting IP. And because it happens across several scan loops, it leads to errors like #4999, when worker registers itself way beyond its 30min deadline.

Unfortunately, Azure API doesn't allow "one-click checkout" to create instance with one call (unlike GCP, AWS).

Some ideas for optimization:

check if all of those instances require public IP in the first place (if they use generic worker and proxy tunnel, they probably don't)
investigate if it is possible to re-use IP/NIC interfaces for different VMs. So we would keep a pool of "reserved" IPs/NICs, and will not deprovision them instantly, but keep them alive for couple of hours - days(?) to be able to use them once again for newly created instances
Split Azure loop into several concurrent jobs/threads/loops to allow parallel operations

Apr 01 '22 13:04 lotas

Public IPs should only be needed by docker-worker (for integration with https://github.com/taskcluster/stateless-dns-server) and I don't think we deploy docker-worker in azure, so I think public IPs aren't needed at all in our azure worker pools.

Apr 01 '22 14:04 petemoore

that's great news @petemoore. I checked provider configs on FxCI, and Azure configs do not contain anything related to docker worker config, only generic worker configs. We could make this check on the provider config level, and if it doesn't contain docker config, we can skip IP/NIC creation. I would need to verify this first, if it works at all without ip/nic

It might require though different approach - https://docs.microsoft.com/en-us/azure/virtual-network/nat-gateway/nat-overview

Upd: also confirmed by :mcornmesser

From a previous conversations we have no need for public IPs in Azure. The VMs are provisioned with because of an old/not current requirement for livelog.

Apr 01 '22 14:04 lotas

Before merging scanner and provisioner into a single process, I would like to experiment in splitting azure scanner from the rest. Idea would be to have existing workerScanner doing loops for everything except azure workers. And second one will only check azure workers.

Provisional steps towards improving Azure scanner:

step 1: generic workers on Azure do not need public IP (unless someone is using RDP session), so provisioning of those is optional 44.9.0-2
step 2: split azure worker scanner from rest of providers. Isolation will help to unblock faster cloud providers that might not want to wait 2 hours until azure is done provisioning own workers
step 3: measure performance of the isolated azure scanner and determine strategies to improve "1 resource per loop" provisioning story.

Upd: this bugzilla issue gave more insights of the problem: there was almost 10 hours of worker scanner misery, caused by failing azure api calls. During that time there were 336 events of cloud-api-paused, because of calls taking minutes or failures with 50x errors.

In my opinion we should decouple Azure and give it increased max iteration time values.

Apr 07 '22 17:04 lotas