pai icon indicating copy to clipboard operation
pai copied to clipboard

The service cannot be started after installation.

Open siaimes opened this issue 4 years ago • 1 comments

maybe duel to docker-cache service do not start after /bin/bash quick-start-service.sh

But the job cannot be started. This is the log of kubelet, it keeps outputting the image pull completion. The openpai system did not output any logs.

csip@csip-090:~$ sudo systemctl status kubelet.service 
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-04-27 13:19:30 CST; 4min 39s ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 173276 (kubelet)
    Tasks: 0
   Memory: 32.5M
      CPU: 1.107s
   CGroup: /system.slice/kubelet.service
           ‣ 173276 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.90 --hostname-override=csip-090 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf 

Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.097675  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:31 csip-090 kubelet[173276]: I0427 13:23:31.776293  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.097618  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:41 csip-090 kubelet[173276]: I0427 13:23:41.788918  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.097632  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.382029  173276 kubelet_getters.go:177] status for pod nginx-proxy-csip-090 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-04-27 11:48:40 +0800 CST  } {Ready True 0001-01-01 00:00:00 +000
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.662060  173276 endpoint.go:111] State pushed for device plugin github.com/fuse
Apr 27 13:23:51 csip-090 kubelet[173276]: I0427 13:23:51.800339  173276 setters.go:73] Using node IP: "172.17.175.90"
Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.097620  173276 kube_docker_client.go:342] Pulling image "openpai/standard:python_3.6-pytorch_1.2.0-gpu": "7b872974e97c: Pull complete "
Apr 27 13:24:01 csip-090 kubelet[173276]: I0427 13:24:01.813667  173276 setters.go:73] Using node IP: "172.17.175.90"

Originally posted by @siaimes in https://github.com/microsoft/pai/issues/5445#issuecomment-827327246

siaimes avatar Jul 27 '21 13:07 siaimes

The reason why I can manually start the docker-cache service in this https://github.com/microsoft/pai/issues/5445#issuecomment-827337790 is that I have been upgrading from v1.3.0 to v1.6.0, so I already have images related to openpai v1.6.0 on my node.

However, for users with a fresh installation, running the command /bin/bash quick-start-service.sh will fail.

  1. All nodes do not have images of the corresponding version of openpai;
  2. All nodes have modified /etc/docker/daemon.json to point the image pull address to the master node;
  3. The docker-cache service of the master node is not started.

It can be found that these three conditions are in conflict.

Possible solutions: Pull the docker-cache image to the master node before restarting docker. When running /bin/bash quick-start-service.sh, first start the docker-cache service.

siaimes avatar Jul 27 '21 14:07 siaimes