akash tx update response & behavior when provider is out of resources
When a user updates his deployment, he may get the following, confusing him, message:
in the following example he was using akashlytics to update his deployment
web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1
This is happening because K8s won't destroy an old pod instance until it ensures the new one has been created.
Since there is no available node for deploying the new pod, it gets stuck in "Running" & "Pending" state.
Things will move on as soon as one of the nodes gets enough CPU, RAM & disk requested by the deployment.
This is how K8s is working in order to prevent the service outage, however the user might want to get a better message. OR, alternatively, a user could be granted an option such as --force which would destroy the previously running Pod, i.e. that would probably be similar to destroy & recreate method.
root@foxtrot:~# kubectl -n $NS get pods
NAME READY STATUS RESTARTS AGE
web-69989588c7-2w5c4 1/1 Running 0 17h
web-6db9665ccb-92p4v 0/1 Pending 0 18m
root@foxtrot:~# kubectl -n $NS describe pods | grep -Ew "^Name:|cpu:"
Name: web-69989588c7-2w5c4
cpu: 10
cpu: 10
Name: web-6db9665ccb-92p4v
cpu: 10
cpu: 10
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.444] inventory fetched module=provider-cluster cmp=service cmp=inventory-service nodes=7
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=foxtrot.provider available-cpu="units:<val:\"6875\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"16263143424\" > " available-storage="quantity:<val:\"225335708095\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=golf.provider available-cpu="units:<val:\"125\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"30669094912\" > " available-storage="quantity:<val:\"880184186644\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources module=provider-cluster cmp=service cmp=inventory-service node-id= available-cpu="units:<val:\"0\" > attributes:<key:\"arch\" > " available-memory="quantity:<val:\"0\" > " available-storage="quantity:<val:\"0\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=alpha.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"13253083136\" > " available-storage="quantity:<val:\"849673462802\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=bravo.ingress available-cpu="units:<val:\"8025\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"22012821504\" > " available-storage="quantity:<val:\"313339421714\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=charley.ingress available-cpu="units:<val:\"5625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"27698855936\" > " available-storage="quantity:<val:\"96585534905\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources module=provider-cluster cmp=service cmp=inventory-service node-id=delta.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"31188185088\" > " available-storage="quantity:<val:\"880234408520\" > "
cc @boz @dmikey
I am that user -
Filled up a providers machines with some xmrig deployments to over 90% fill rate - had 2 of them crash, went to re-deploy a new image using the update button in Akashlytics and was unable to
web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1
As a provider, I have the same situation with the same user deploying 2 deployement and beeing unable to update the deployement. A new replica set is created but because of lack of ressource the new replica set never come online. Note that there is no disruption of service as the old replica set stay up.
Thanks all for the report. Interesting case. Few thoughts:
- Some kind of optional
--forcewith a clear message around it is a good suggestion. - Inventory "overcommit" can be reduced.
- Inventory can always reserve double the largest deployed resources.
All of them have drawbacks, of course. In the meantime, for mining and other stateless workloads, I suggest closing the deployment and creating a new one if you hit the described scenario.
I think having a default option which use --force is a good solution. I think the majority of deployment can be forced. For the rare case where high availability is needed a special option for that could be used but with the understanding that update may be more difficult. Currently as a provider when I see a deployment stuck because of that, generally miner I just force delete the old replica set and make sure the deployment work again.
I've hit this today, bug still in place.