[BUG] GRAPE engine freezes on some loading GraphAr graphs
Describe the bug
There is a problem of loading some GraphAr graphs with GAE: GRAPE engine infinitely utilizes up to 100% CPU and loading process freezes at all. It always reproduces on the same GraphAr graphs.
To Reproduce
It is assumed that k8s is already installed (with kubeadm, single untainted master node in our case) and GIE and GAE are installed with HELM as described in official documentation:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
gs-system coordinator-gs-gae-6579fc47db-q5t2b 1/1 Running 0 4m8s
gs-system gs-engine-gs-gae-0 3/3 Running 0 4m4s
gs-system gs-engine-gs-gae-1 3/3 Running 0 3m58s
gs-system gs-engine-gs-gae-gs-gae-vineyard-etcd-0 1/1 Running 0 4m4s
gs-system gs-etcd0 1/1 Running 0 4m36s
gs-system gs-gie-gie-standalone-frontend-0 1/1 Running 0 4m29s
gs-system gs-gie-gie-standalone-store-0 1/1 Running 0 4m29s
gs-system gs-interactive-frontend-gs-gae-5b468dbd9b-4jnwb 1/1 Running 0 4m4s
kube-flannel kube-flannel-ds-lvfgg 1/1 Running 2 (2d18h ago) 2d19h
kube-system coredns-75568d8fdb-pzf5g 1/1 Running 0 2d19h
kube-system coredns-75568d8fdb-s66zj 1/1 Running 0 2d19h
kube-system etcd-master 1/1 Running 2 (2d18h ago) 2d19h
kube-system kube-apiserver-master 1/1 Running 2 (2d18h ago) 2d19h
kube-system kube-controller-manager-master 1/1 Running 1 (2d18h ago) 2d19h
kube-system kube-proxy-8c5ng 1/1 Running 0 2d19h
kube-system kube-scheduler-master 1/1 Running 1 (2d18h ago) 2d19h
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane 2d19h v1.28.15
svc_rnd-grapher@master:~$ kubectl get pods gs-engine-gs-gae-0 -n gs-system -o yaml | grep volumeMounts -A 5
...
volumeMounts:
- mountPath: /dev/shm
name: host-shm
- mountPath: /tmp/data
name: data
- mountPath: /tmp/vineyard_workspace
...
Steps to reproduce the behavior:
- Place GraphAr graph two-one.tar.gz to volume mount
dataand extract into/tmp/data/two-onedirectory - Extract test.py.tar.gz and replace the host and port retrieval if needed
- Run
test.py. For examplepython test.py - See
Trying to connect to <ip>:<port>, where ip and port could be different depends on your build. And nothing else happens then, script infinitely keeps running - If you run
ps aux, you could also see CPU utilization with grape_engine:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1001 2313831 99.4 0.2 1350408 67908 ? Sl 06:34 3:21 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56892 -v 1 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
1001 2313832 99.8 0.2 698068 57896 ? Sl 06:34 3:21 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56892 -v 1 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
Expected behavior
Graph imported successfully without any freezing.
Environment:
- GraphScope version: v0.30.0
- OS: Ubuntu
- Version 24.04
- Kubernetes Version v1.28.15
- Python version: 3.11.11 (with following dependencies: graphscope==0.29.0, graphscope-client==0.29.0, pandas==2.0.3, aiohttp, async_timeout)
Additional context
We have been trying different GraphAr graphs, and part of them were imported well, part of them cause an error, described in our previous request and some of them caused issue, described in this request
GraphAr graph two-one.tar.gz were made via graphar import -c /opt/graphar-in/import.two-one.yml (sources also attached)
We also got threads backtraces from both gs-engine-gs-gae and core dump from one of the pods (in our case it was gs-engine-gs-gae-0):
gs-engine-gs-gae-0:
kubectl exec -ti -n gs-system gs-engine-gs-gae-0 -- bash
graphscope@gs-engine-gs-gae-0:~$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
graphsc+ 1 0.0 0.0 4364 3200 ? Ss 12:59 0:00 bash -c while true; do if [ -e /tmp/grape_engine.INFO ]; then tail -f /tmp/grape_engine.INFO; fi; sleep 1; do
graphsc+ 313 0.0 0.0 2896 1792 ? Ss 13:00 0:00 /bin/sh -c cat /tmp/hosts_of_nodes | sudo tee -a /etc/hosts && ( test ! -r ./.profile || . ./.profile;
graphsc+ 322 0.0 0.1 219580 26496 ? Sl 13:00 0:00 /usr/bin/orted -mca ess env -mca ess_base_jobid 3547725824 -mca ess_base_vpid 1 -mca ess_base_num_procs 3 -mc
graphsc+ 332 99.0 0.3 1356552 78056 ? Sl 13:00 14:02 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56004 -v 1 --vineyard_socket /tmp/vineyard_workspace/v
graphsc+ 382 0.0 0.0 2828 1536 ? S 13:00 0:00 tail -f /tmp/grape_engine.INFO
graphsc+ 600 0.0 0.0 4628 3712 pts/1 Ss 13:02 0:00 bash
graphsc+ 10843 0.0 0.0 7068 3072 pts/1 R+ 13:16 0:00 ps aux
gdb -p 332
(gdb) thread apply all bt # see bt0.out.tar.gz
gcore 332 # see core.332.tar.gz
gs-engine-gs-gae-1:
graphscope@gs-engine-gs-gae-1:~$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
graphsc+ 1 0.0 0.0 4364 3328 ? Ss 12:59 0:00 bash -c while true; do if [ -e /tmp/grape_engine.INFO ]; then tail -f /tmp/grape_engine.INFO; fi; sleep 1; do
graphsc+ 288 0.0 0.0 2896 1664 ? Ss 13:00 0:00 /bin/sh -c cat /tmp/hosts_of_nodes | sudo tee -a /etc/hosts && ( test ! -r ./.profile || . ./.profile;
graphsc+ 297 0.0 0.1 219348 25856 ? Sl 13:00 0:00 /usr/bin/orted -mca ess env -mca ess_base_jobid 3547725824 -mca ess_base_vpid 2 -mca ess_base_num_procs 3 -mc
graphsc+ 300 99.9 0.2 698068 58384 ? Sl 13:00 24:46 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56004 -v 1 --vineyard_socket /tmp/vineyard_workspace/v
graphsc+ 327 0.0 0.0 2828 1536 ? S 13:00 0:00 tail -f /tmp/grape_engine.INFO
graphsc+ 4857 0.0 0.0 4628 3840 pts/1 Ss 13:22 0:00 bash
graphsc+ 12672 0.0 0.0 7068 3072 pts/1 R+ 13:25 0:00 ps aux
gdb -p 300
(gdb) thread apply all bt # see bt1.out.tar.gz
import.two-one.yml.tar.gz test.py.tar.gz bt1.out.tar.gz bt0.out.tar.gz e.csv v.csv core.332.tar.gz two-one.tar.gz
/cc @yecol @sighingnow, this issus/pr has had no activity for a long time, please help to review the status and assign people to work on it.