Combine worker-manager provisioner and worker scanner in to a single processes
The worker-manager has two processes that could be combined:
-
worker-manager-provisioner- Looks at the pending tasks in thequeue, and spins up workers (and maybe spins down) -
worker-manager-workerscanner- Checks if provisioned workers are still running. Workers are often configured to shut themselves down if they have no tasks to claim.
According to @imbstack, this design was informed by a previous AWS-only provisioner, which was too slow and would get itself stuck. The new provisioner is designed to launch many workers in parallel, as quickly as possible, and not get stuck. However, this seems to lead to over-provisioning.
His suggestion was to combine the two into a single process that runs the three steps repeatedly:
- Scan the active workers to update state
- Check the queues
- Start an appropriate amount of workers
This would ensure that scanning and provisioning workers remain in sync.
From FxCI logs I can see that:
- provisioner is currently quite fast, seconds to minutes (worst cases)
- scanner is extremely slow on FxCI - 10min to 30min and above
If we combine those now, it would mean that provisioning part of it might get delayed, and new task would be waiting for the scanner part for too long.
Plus there is a huge disproportion in providers, for example at this very moment fxci runs:
| provider | running tasks | % |
|---|---|---|
| aws | 4250 | ~84% |
| azure | 696 | ~13% |
| gcp | 108 | ~2% |
Those values will probably change, but still, some providers will run longer.
My suggestions:
Split processes Run separate scanner/provisioner for each cloud (providers would be isolated and would not block each other)
Optimize scanner
Minimize scanning time.
For example call list endpoints to enumerate all running instances and avoid doing multiple calls
- GCP instances/list
- Azure list-all
- AWS DescribeInstances
And then to find instances that were missing in api response but are considered as running according to database, and additionally query those, or set to terminated
Was able to dive deeper into one of those slower scanner loops. Seems like it was caused by a sudden spike on some types of azure win10 workers.
having around 40-50 workers for that type, then suddenly +100

which made whole scan loop take 1h30min:

And loop slows down at the same time, query for get_non_stopped_workers_quntil:

without conclusions for now, tbc
Couple of interesting observations came out when looking at one of the slow loops:
- we didn't hit rate limiting in cloud APIs in recent weeks, which means that we are not being slowed down by cloud provider
- Some VMs take 2h+ to deprovision (spread across multiple scan loops)
- Azure API calls for 2h+ slow loop:
| type | count |
|---|---|
deprovision VM |
542 |
deprovision NIC |
404 |
deprovision disks |
344 |
deprovision IP |
288 |
| - | - |
provisioning IP |
1265 |
provisioning NIC |
750 |
provisioning VM |
384 |
| - | - |
| query resource | 1927 |
296 vm's took 3+ loops to provision ip only, 88 vm's waited 4 loops
- Azure API calls for 1h+ slow loop:
| type | count |
|---|---|
deprovision VM |
189 |
deprovision NIC |
160 |
deprovision disks |
172 |
deprovision IP |
128 |
| - | - |
provisioning IP |
349 |
provisioning NIC |
150 |
provisioning VM |
36 |
| - | - |
| query resource | 499 |
36 vm's took 3 loops to provision ip only
This corresponds to the order of operations:
- Deprovision: kill VM, kill NIC, kill Disks, kill IP
- Provision: request IP, request NIC, request VM
Based on those values I can guess that main bottleneck in provisioning of new instances is getting IP. And because it happens across several scan loops, it leads to errors like #4999, when worker registers itself way beyond its 30min deadline.
Unfortunately, Azure API doesn't allow "one-click checkout" to create instance with one call (unlike GCP, AWS).
Some ideas for optimization:
- check if all of those instances require public IP in the first place (if they use generic worker and proxy tunnel, they probably don't)
- investigate if it is possible to re-use IP/NIC interfaces for different VMs. So we would keep a pool of "reserved" IPs/NICs, and will not deprovision them instantly, but keep them alive for couple of hours - days(?) to be able to use them once again for newly created instances
- Split Azure loop into several concurrent jobs/threads/loops to allow parallel operations
Public IPs should only be needed by docker-worker (for integration with https://github.com/taskcluster/stateless-dns-server) and I don't think we deploy docker-worker in azure, so I think public IPs aren't needed at all in our azure worker pools.
that's great news @petemoore. I checked provider configs on FxCI, and Azure configs do not contain anything related to docker worker config, only generic worker configs. We could make this check on the provider config level, and if it doesn't contain docker config, we can skip IP/NIC creation. I would need to verify this first, if it works at all without ip/nic
It might require though different approach - https://docs.microsoft.com/en-us/azure/virtual-network/nat-gateway/nat-overview
Upd: also confirmed by :mcornmesser
From a previous conversations we have no need for public IPs in Azure. The VMs are provisioned with because of an old/not current requirement for livelog.
Before merging scanner and provisioner into a single process, I would like to experiment in splitting azure scanner from the rest.
Idea would be to have existing workerScanner doing loops for everything except azure workers. And second one will only check azure workers.
Provisional steps towards improving Azure scanner:
- step 1: generic workers on Azure do not need public IP (unless someone is using RDP session), so provisioning of those is optional 44.9.0-2
- step 2: split azure worker scanner from rest of providers. Isolation will help to unblock faster cloud providers that might not want to wait 2 hours until azure is done provisioning own workers
- step 3: measure performance of the isolated azure scanner and determine strategies to improve "1 resource per loop" provisioning story.
Upd:
this bugzilla issue gave more insights of the problem: there was almost 10 hours of worker scanner misery, caused by failing azure api calls. During that time there were 336 events of cloud-api-paused, because of calls taking minutes or failures with 50x errors.
In my opinion we should decouple Azure and give it increased max iteration time values.