pai icon indicating copy to clipboard operation
pai copied to clipboard

How to restart all service of k8s and openpai ?

Open poetryben88 opened this issue 4 years ago • 10 comments

My dev machine and master worker restarted , then k8s and openpai was not started . What command shall I type on master worker to start the services(k8s and openpai)?

Thanks ! God bless you.

poetryben88 avatar Oct 12 '21 10:10 poetryben88

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

siaimes avatar Oct 12 '21 12:10 siaimes

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

poetryben88 avatar Oct 13 '21 08:10 poetryben88

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

Just disable swap from the system level.

siaimes avatar Oct 13 '21 09:10 siaimes

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

Just disable swap from the system level.

兄弟,我是问有没有办法重启整套k8s服务,不是问开关swap 交换内存啊。

poetryben88 avatar Oct 13 '21 09:10 poetryben88

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

Just disable swap from the system level.

兄弟,我是问有没有办法重启整套k8s服务,不是问开关swap 交换内存啊。

sudo systemctl restart kubelet.service

siaimes avatar Oct 13 '21 10:10 siaimes

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

Just disable swap from the system level.

兄弟,我是问有没有办法重启整套k8s服务,不是问开关swap 交换内存啊。

sudo systemctl restart kubelet.service

But you cannot start kubelet without disabling swap.

siaimes avatar Oct 13 '21 10:10 siaimes

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

You can't. k8s is meant to run forever, and there is no such a 'one-click' restart. If other pods on k8s, not only pai, also failed to start, even the kube-apiserver(s), then perhaps your cluster was corrupted. You have to check the logs to see what happened

luxius-luminus avatar Oct 13 '21 11:10 luxius-luminus

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

thanks! If I have to restart k8s and pai service, how shall I do ?

You can't. k8s is meant to run forever, and there is no such a 'one-click' restart. If other pods on k8s, not only pai, also failed to start, even the kube-apiserver(s), then perhaps your cluster was corrupted. You have to check the logs to see what happened

thanks, I understand.

poetryben88 avatar Oct 14 '21 02:10 poetryben88

I reboot one of the nodes in our cluster, but the service on this node didn't resume. I checked the services and got this

sudo docker ps -a

CONTAINER ID   IMAGE                                    COMMAND                  CREATED       STATUS                           PORTS                    NAMES
5098da9556d9   openpai/storage-manager                  "/usr/bin/entrypoint…"   8 hours ago   Exited (137) About an hour ago                            k8s_storage-manager_storage-manager-ds-5z7vp_default_fd8f91fc-df77-4969-a1a1-a47eb3fef555_0
41e1dfecb009   mirrorgooglecontainers/pause-amd64:3.1   "/pause"                 8 hours ago   Exited (0) About an hour ago                              k8s_POD_storage-manager-ds-5z7vp_default_fd8f91fc-df77-4969-a1a1-a47eb3fef555_0
894baf6e53d4   openpai/node-exporter                    "/bin/node_exporter …"   8 hours ago   Exited (2) About an hour ago                              k8s_node-exporter_node-exporter-h92vc_default_bb11a9f2-15a1-4cb3-bf1f-6cbf69c1806e_0
9b6bfe70aa34   mirrorgooglecontainers/pause-amd64:3.1   "/pause"                 8 hours ago   Exited (0) About an hour ago                              k8s_POD_node-exporter-h92vc_default_bb11a9f2-15a1-4cb3-bf1f-6cbf69c1806e_0
8432a3d1fffd   openpai/log-manager-nginx                "/usr/local/openrest…"   8 hours ago   Exited (0) About an hour ago                              k8s_log-manager-nginx_log-manager-ds-5q2gk_default_a371ef21-b09b-4707-907c-0b1035a3ae4e_0
4cea45e035ef   openpai/log-manager-cleaner              "/sbin/tini -- /usr/…"   8 hours ago   Exited (143) About an hour ago                            k8s_log-cleaner_log-manager-ds-5q2gk_default_a371ef21-b09b-4707-907c-0b1035a3ae4e_0
3b9990b8b051   mirrorgooglecontainers/pause-amd64:3.1   "/pause"               
...

I tried to restart the cluster with ./paictl service stop && ./paictl service start . However, the situation remains the same.

If I try to launch the docker containers manually, I got error like this

sudo docker start  k8s_calico-node_calico-node-fpw4r_kube-system_d8279838-5f79-4a0a-8045-b97b79176bf2_7   

Error response from daemon: cannot join network of a non running container: 618600369a1b0a08048ba229a4a3aa266a911ebbec452ee28bf3d03a5ea1e8db
Error: failed to start containers: k8s_calico-node_calico-node-fpw4r_kube-system_d8279838-5f79-4a0a-8045-b97b79176bf2_7

Then I also noticed the following state

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

So I try the following operation

sudo swapoff -a
sudo systemctl restart kubelet.service

Luckily, the services resumed and the node got online again. I write down this in case anyone run into this kind of problems like me ;)

zhangxydlut avatar Oct 29 '21 11:10 zhangxydlut

I reboot one of the nodes in our cluster, but the service on this node didn't resume. I checked the services and got this

sudo docker ps -a

CONTAINER ID   IMAGE                                    COMMAND                  CREATED       STATUS                           PORTS                    NAMES
5098da9556d9   openpai/storage-manager                  "/usr/bin/entrypoint…"   8 hours ago   Exited (137) About an hour ago                            k8s_storage-manager_storage-manager-ds-5z7vp_default_fd8f91fc-df77-4969-a1a1-a47eb3fef555_0
41e1dfecb009   mirrorgooglecontainers/pause-amd64:3.1   "/pause"                 8 hours ago   Exited (0) About an hour ago                              k8s_POD_storage-manager-ds-5z7vp_default_fd8f91fc-df77-4969-a1a1-a47eb3fef555_0
894baf6e53d4   openpai/node-exporter                    "/bin/node_exporter …"   8 hours ago   Exited (2) About an hour ago                              k8s_node-exporter_node-exporter-h92vc_default_bb11a9f2-15a1-4cb3-bf1f-6cbf69c1806e_0
9b6bfe70aa34   mirrorgooglecontainers/pause-amd64:3.1   "/pause"                 8 hours ago   Exited (0) About an hour ago                              k8s_POD_node-exporter-h92vc_default_bb11a9f2-15a1-4cb3-bf1f-6cbf69c1806e_0
8432a3d1fffd   openpai/log-manager-nginx                "/usr/local/openrest…"   8 hours ago   Exited (0) About an hour ago                              k8s_log-manager-nginx_log-manager-ds-5q2gk_default_a371ef21-b09b-4707-907c-0b1035a3ae4e_0
4cea45e035ef   openpai/log-manager-cleaner              "/sbin/tini -- /usr/…"   8 hours ago   Exited (143) About an hour ago                            k8s_log-cleaner_log-manager-ds-5q2gk_default_a371ef21-b09b-4707-907c-0b1035a3ae4e_0
3b9990b8b051   mirrorgooglecontainers/pause-amd64:3.1   "/pause"               
...

I tried to restart the cluster with ./paictl service stop && ./paictl service start . However, the situation remains the same.

If I try to launch the docker containers manually, I got error like this

sudo docker start  k8s_calico-node_calico-node-fpw4r_kube-system_d8279838-5f79-4a0a-8045-b97b79176bf2_7   

Error response from daemon: cannot join network of a non running container: 618600369a1b0a08048ba229a4a3aa266a911ebbec452ee28bf3d03a5ea1e8db
Error: failed to start containers: k8s_calico-node_calico-node-fpw4r_kube-system_d8279838-5f79-4a0a-8045-b97b79176bf2_7

Then I also noticed the following state

All services will start automatically. One possible situation is that you are using a swap partition. In this case, the cluster will not start automatically because k8s does not support swap.

So I try the following operation

sudo swapoff -a
sudo systemctl restart kubelet.service

Luckily, the services resumed and the node got online again. I write down this in case anyone run into this kind of problems like me ;)

You need to permanently close the swap, otherwise, any node will encounter this issue again after restarting.

siaimes avatar Oct 29 '21 13:10 siaimes