[BUG] GRAPE engine freezes on some loading GraphAr graphs

Open atberium opened this issue 11 months ago • 1 comments

Describe the bug

There is a problem of loading some GraphAr graphs with GAE: GRAPE engine infinitely utilizes up to 100% CPU and loading process freezes at all. It always reproduces on the same GraphAr graphs.

To Reproduce

It is assumed that k8s is already installed (with kubeadm, single untainted master node in our case) and GIE and GAE are installed with HELM as described in official documentation:

kubectl get pods --all-namespaces
NAMESPACE      NAME                                              READY   STATUS    RESTARTS        AGE
gs-system      coordinator-gs-gae-6579fc47db-q5t2b               1/1     Running   0               4m8s
gs-system      gs-engine-gs-gae-0                                3/3     Running   0               4m4s
gs-system      gs-engine-gs-gae-1                                3/3     Running   0               3m58s
gs-system      gs-engine-gs-gae-gs-gae-vineyard-etcd-0           1/1     Running   0               4m4s
gs-system      gs-etcd0                                          1/1     Running   0               4m36s
gs-system      gs-gie-gie-standalone-frontend-0                  1/1     Running   0               4m29s
gs-system      gs-gie-gie-standalone-store-0                     1/1     Running   0               4m29s
gs-system      gs-interactive-frontend-gs-gae-5b468dbd9b-4jnwb   1/1     Running   0               4m4s
kube-flannel   kube-flannel-ds-lvfgg                             1/1     Running   2 (2d18h ago)   2d19h
kube-system    coredns-75568d8fdb-pzf5g                          1/1     Running   0               2d19h
kube-system    coredns-75568d8fdb-s66zj                          1/1     Running   0               2d19h
kube-system    etcd-master                                       1/1     Running   2 (2d18h ago)   2d19h
kube-system    kube-apiserver-master                             1/1     Running   2 (2d18h ago)   2d19h
kube-system    kube-controller-manager-master                    1/1     Running   1 (2d18h ago)   2d19h
kube-system    kube-proxy-8c5ng                                  1/1     Running   0               2d19h
kube-system    kube-scheduler-master                             1/1     Running   1 (2d18h ago)   2d19h

kubectl get nodes
NAME     STATUS   ROLES           AGE     VERSION
master   Ready    control-plane   2d19h   v1.28.15

svc_rnd-grapher@master:~$ kubectl get pods gs-engine-gs-gae-0 -n gs-system -o yaml | grep volumeMounts -A 5
...
    volumeMounts:
    - mountPath: /dev/shm
      name: host-shm
    - mountPath: /tmp/data
      name: data
    - mountPath: /tmp/vineyard_workspace
...

Steps to reproduce the behavior:

Place GraphAr graph two-one.tar.gz to volume mount data and extract into /tmp/data/two-one directory
Extract test.py.tar.gz and replace the host and port retrieval if needed
Run test.py. For example python test.py
See Trying to connect to <ip>:<port>, where ip and port could be different depends on your build. And nothing else happens then, script infinitely keeps running
If you run ps aux, you could also see CPU utilization with grape_engine:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1001     2313831 99.4  0.2 1350408 67908 ?       Sl   06:34   3:21 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56892 -v 1 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
1001     2313832 99.8  0.2 698068 57896 ?        Sl   06:34   3:21 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56892 -v 1 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock

Expected behavior

Graph imported successfully without any freezing.

Environment:

GraphScope version: v0.30.0
OS: Ubuntu
Version 24.04
Kubernetes Version v1.28.15
Python version: 3.11.11 (with following dependencies: graphscope==0.29.0, graphscope-client==0.29.0, pandas==2.0.3, aiohttp, async_timeout)

Additional context

We have been trying different GraphAr graphs, and part of them were imported well, part of them cause an error, described in our previous request and some of them caused issue, described in this request GraphAr graph two-one.tar.gz were made via graphar import -c /opt/graphar-in/import.two-one.yml (sources also attached) We also got threads backtraces from both gs-engine-gs-gae and core dump from one of the pods (in our case it was gs-engine-gs-gae-0): gs-engine-gs-gae-0:

kubectl exec -ti -n gs-system gs-engine-gs-gae-0 -- bash

graphscope@gs-engine-gs-gae-0:~$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
graphsc+       1  0.0  0.0   4364  3200 ?        Ss   12:59   0:00 bash -c while true; do if [ -e /tmp/grape_engine.INFO ]; then tail -f /tmp/grape_engine.INFO; fi; sleep 1; do
graphsc+     313  0.0  0.0   2896  1792 ?        Ss   13:00   0:00 /bin/sh -c cat /tmp/hosts_of_nodes | sudo tee -a /etc/hosts && ( test ! -r ./.profile || . ./.profile;       
graphsc+     322  0.0  0.1 219580 26496 ?        Sl   13:00   0:00 /usr/bin/orted -mca ess env -mca ess_base_jobid 3547725824 -mca ess_base_vpid 1 -mca ess_base_num_procs 3 -mc
graphsc+     332 99.0  0.3 1356552 78056 ?       Sl   13:00  14:02 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56004 -v 1 --vineyard_socket /tmp/vineyard_workspace/v
graphsc+     382  0.0  0.0   2828  1536 ?        S    13:00   0:00 tail -f /tmp/grape_engine.INFO
graphsc+     600  0.0  0.0   4628  3712 pts/1    Ss   13:02   0:00 bash
graphsc+   10843  0.0  0.0   7068  3072 pts/1    R+   13:16   0:00 ps aux

gdb -p 332
(gdb) thread apply all bt # see bt0.out.tar.gz

gcore 332 # see core.332.tar.gz

gs-engine-gs-gae-1:

graphscope@gs-engine-gs-gae-1:~$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
graphsc+       1  0.0  0.0   4364  3328 ?        Ss   12:59   0:00 bash -c while true; do if [ -e /tmp/grape_engine.INFO ]; then tail -f /tmp/grape_engine.INFO; fi; sleep 1; do
graphsc+     288  0.0  0.0   2896  1664 ?        Ss   13:00   0:00 /bin/sh -c cat /tmp/hosts_of_nodes | sudo tee -a /etc/hosts && ( test ! -r ./.profile || . ./.profile;       
graphsc+     297  0.0  0.1 219348 25856 ?        Sl   13:00   0:00 /usr/bin/orted -mca ess env -mca ess_base_jobid 3547725824 -mca ess_base_vpid 2 -mca ess_base_num_procs 3 -mc
graphsc+     300 99.9  0.2 698068 58384 ?        Sl   13:00  24:46 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56004 -v 1 --vineyard_socket /tmp/vineyard_workspace/v
graphsc+     327  0.0  0.0   2828  1536 ?        S    13:00   0:00 tail -f /tmp/grape_engine.INFO
graphsc+    4857  0.0  0.0   4628  3840 pts/1    Ss   13:22   0:00 bash
graphsc+   12672  0.0  0.0   7068  3072 pts/1    R+   13:25   0:00 ps aux

gdb -p 300
(gdb) thread apply all bt # see bt1.out.tar.gz

import.two-one.yml.tar.gz test.py.tar.gz bt1.out.tar.gz bt0.out.tar.gz e.csv v.csv core.332.tar.gz two-one.tar.gz

Feb 24 '25 05:02 atberium

/cc @yecol @sighingnow, this issus/pr has had no activity for a long time, please help to review the status and assign people to work on it.

Mar 12 '25 00:03 github-actions[bot]