The service cannot be started after installation.
maybe duel to docker-cache service do not start after /bin/bash quick-start-service.sh
But the job cannot be started. This is the log of kubelet, it keeps outputting the image pull completion. The openpai system did not output any logs.
csip@csip-090:~$ sudo systemctl status kubelet.service ● kubelet.service - Kubernetes Kubelet Server Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2021-04-27 13:19:30 CST; 4min 39s ago Docs: https://github.com/GoogleCloudPlatform/kubernetes Main PID: 173276 (kubelet) Tasks: 0 Memory: 32.5M CPU: 1.107s CGroup: /system.slice/kubelet.service ‣ 173276 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.90 --hostname-override=csip-090 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.097675 173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete " Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.776293 173276 setters.go:73] Using node IP: "172.17.175.90" Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.097618 173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete " Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.788918 173276 setters.go:73] Using node IP: "172.17.175.90" Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.097632 173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete " Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.382029 173276 kubelet_getters.go:177] status for pod nginx-proxy-csip-090 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-04-27 11:48:40 +0800 CST } {Ready True 0001-01-01 00:00:00 +000 Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.662060 173276 endpoint.go:111] State pushed for device plugin github.com/fuse Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.800339 173276 setters.go:73] Using node IP: "172.17.175.90" Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.097620 173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete " Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.813667 173276 setters.go:73] Using node IP: "172.17.175.90"
Originally posted by @siaimes in https://github.com/microsoft/pai/issues/5445#issuecomment-827327246
The reason why I can manually start the docker-cache service in this https://github.com/microsoft/pai/issues/5445#issuecomment-827337790 is that I have been upgrading from v1.3.0 to v1.6.0, so I already have images related to openpai v1.6.0 on my node.
However, for users with a fresh installation, running the command /bin/bash quick-start-service.sh will fail.
- All nodes do not have images of the corresponding version of openpai;
- All nodes have modified
/etc/docker/daemon.jsonto point the image pull address to the master node; - The docker-cache service of the master node is not started.
It can be found that these three conditions are in conflict.
Possible solutions:
Pull the docker-cache image to the master node before restarting docker. When running /bin/bash quick-start-service.sh, first start the docker-cache service.