katalyst-core icon indicating copy to clipboard operation
katalyst-core copied to clipboard

[install error] katalyst-agent CrashLoopBackOff

Open googs1025 opened this issue 2 years ago • 1 comments

What happened?

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get pods -nkatalyst-system
NAME                                   READY   STATUS             RESTARTS         AGE
katalyst-agent-4qx2t                   0/1     CrashLoopBackOff   10 (31s ago)     26m
katalyst-agent-jdl97                   0/1     CrashLoopBackOff   10 (22s ago)     26m
katalyst-agent-pwm7l                   0/1     Error              10 (5m11s ago)   26m
katalyst-controller-845ccf946b-ftxgx   1/1     Running            0                26m
katalyst-controller-845ccf946b-lm9bm   1/1     Running            0                26m
katalyst-metric-765c44bbb5-48ws6       1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-swgc4    1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-x2vct    1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-26c8g      1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-7fs78      1/1     Running            0                26m
root@VM-0-15-ubuntu:/home/ubuntu# kubectl logs katalyst-agent-4qx2t -nkatalyst-system
W0502 08:03:20.626350       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024/05/02 08:03:20 <nil>
I0502 08:03:20.626831       1 otel_prom_metrics_mux.go:94] [katalyst-core/pkg/metrics/metrics-pool.(*openTelemetryPrometheusMetricsEmitterPool).GetMetricsEmitter] add path /metrics to metric emitter
W0502 08:03:20.636464       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I0502 08:03:20.636778       1 network_linux.go:80] [katalyst-core/pkg/util/machine.GetExtraNetworkInfo] namespace list: []
W0502 08:03:20.637199       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: eth0 with devPath: /sys/devices/virtual/net/eth0 which isn't pci device
W0502 08:03:20.637248       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: kube-ipvs0 with devPath: /sys/devices/virtual/net/kube-ipvs0 which isn't pci device
W0502 08:03:20.637281       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: lo with devPath: /sys/devices/virtual/net/lo which isn't pci device
W0502 08:03:20.637311       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth064d18ee with devPath: /sys/devices/virtual/net/veth064d18ee which isn't pci device
W0502 08:03:20.637339       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth06d57915 with devPath: /sys/devices/virtual/net/veth06d57915 which isn't pci device
W0502 08:03:20.637365       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth5290716c with devPath: /sys/devices/virtual/net/veth5290716c which isn't pci device
W0502 08:03:20.637396       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth6f37d282 with devPath: /sys/devices/virtual/net/veth6f37d282 which isn't pci device
W0502 08:03:20.637428       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth87922afb with devPath: /sys/devices/virtual/net/veth87922afb which isn't pci device
W0502 08:03:20.637457       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth8dccdf2e with devPath: /sys/devices/virtual/net/veth8dccdf2e which isn't pci device
I0502 08:03:20.638040       1 file.go:239] [GetUniqueLock] get lock successfully
I0502 08:03:20.638069       1 agent.go:85] initializing "katalyst-agent-reporter"
W0502 08:03:20.638121       1 manager.go:400] failed to retrieve checkpoint for "reporter_manager_checkpoint": checkpoint is not found
I0502 08:03:20.638136       1 manager.go:258] registered plugin name system-reporter-plugin
I0502 08:03:20.638153       1 manager.go:239] plugin system-reporter-plugin run success
I0502 08:03:20.638171       1 manager.go:258] registered plugin name kubelet-reporter-plugin
I0502 08:03:20.638210       1 util_unix.go:104] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
F0502 08:03:20.638341       1 kubeletplugin.go:110] run topology status adapter failed

What did you expect to happen?

All pods start normally

How can we reproduce it (as minimally and precisely as possible)?

None

Software version

Environment:

Kubernetes version (use kubectl version): 1.28 OS version: Ubuntu 22.04 Kernal version: Cgroup driver: cgroupfs/systemd

googs1025 avatar May 02 '24 08:05 googs1025

/kind bug

googs1025 avatar May 02 '24 08:05 googs1025

It may have some errors when run topology status adapter , we have add some error messages in the fatal log https://github.com/kubewharf/katalyst-core/pull/573

luomingmeng avatar May 07 '24 06:05 luomingmeng

It has been solved now. If there are still problems, I will reopen it.

googs1025 avatar May 08 '24 15:05 googs1025