GraphScope icon indicating copy to clipboard operation
GraphScope copied to clipboard

[BUG] v0.13.0 K8S graphscope cluster will be unavailable after execute analytical_engine computing many time

Open dzhiwei opened this issue 3 years ago • 6 comments

Describe the bug v0.13.0 K8S graphscope cluster will be unavailable after execute analytical_engine computing many time.(not test 0.14.0 yet) The edge data is 3.2G, and vertex data is 140M. Analytical_engine will need about 9G memory. available total memory around 22G.

  1. Try to run the computing many time to test the stability. The analytical engine will not release memory after the computing is done even i delete the graph(try to trigger unloadGraph step) and close session. After I run the algorithm computing successfully 2 times the memory cost go to 18G, and it failed in 3rd round due to memory not enough and session cannot be connected any more. what's worse the system will be unavailable after i execute the computing failed.
  2. please print the right log if the memory is insufficient. For now graph return result after session.load_from if memory is insufficient. analytical_engine will not release memory after computing end and unload_graph invoked.

To Reproduce Steps to reproduce the behavior: code:

import graphscope
from graphscope.framework.loader import Loader
from graphscope.framework import dag_utils
import os
from graphscope.client.session import get_default_session


def analyze(count: int = 0):
    graphscope.set_option(show_log=True)
    graphscope.set_option(log_level="DEBUG")
    #sess = graphscope.session(addr="192.168.0.xxx:xxx", cluster_type="k8s", k8s_service_type="LoadBalancer")
    sess = get_default_session()
    g = sess.g()
    prefix = "/tmp/testingdata"
    # /tmp/testingdata/0001/social_network
    try:
        vertices = {
            "person": (
                Loader(
                    os.path.join(prefix, "person_0_0.csv"), header_row=True, delimiter="|"
                ),
                [
                    "firstName",
                    "lastName",
                    "gender",
                    "birthday",
                    "creationDate",
                    "locationIP",
                    "browserUsed",
                ],
                "id",
            )
        }
        edges = {
            "knows": [
                (
                    Loader(
                        os.path.join(prefix, "person_knows_person_0_0_0000"),
                        header_row=True,
                        delimiter="|",
                    ),
                    ["creationDate"],
                    ("Person.id", "person"),
                    ("Person.id2", "person"),
                )
            ]
        }
        print(vertices, edges)
        g = sess.load_from(edges, vertices, True, generate_eid=True)

        # project the projected graph to simple graph, since some methods need the simple graph such as pagerank.
        pg = g.project(vertices={"person": []}, edges={"knows": []})

        ret = graphscope.bfs(pg, 6)
        r1 = ret.to_numpy("r")
        print(r1)
        r3 = ret.to_vineyard_tensor("r")
        print(r3)
        r4 = ret.to_vineyard_dataframe({'id': 'v.id', 'distance': 'r'})
        print(r4)

        ret2 = ret.to_dataframe({'id': 'v.id', 'distance': 'r'}).sort_values(by="distance")
        print(ret2)

        ret.output_to_client("/tmp/testingdata/result/ldbc-bfs-result" + str(count) + ".csv",
                             selector={'id': 'v.id', 'distance': 'r'})

        print('this is the end of the program.. ')
    finally:
        print("unload graph")
        del g
        sess.close()
        print('close graphscope session. ')


if __name__ == "__main__":
    count: int = 0
    while count < 6:
        analyze()

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GraphScope version: [v0.13.0]
  • OS: [MacOS]
  • Version [MacOS12.3]
  • Kubernetes Version [1.22.8]

Additional context

dzhiwei avatar Jun 20 '22 04:06 dzhiwei

Thanks for opening your first issue here! Be sure to follow the issue template! And a maintainer will get back to you shortly! Please feel free to contact us on DingTalk, WeChat account(graphscope) or Slack. We are happy to answer your questions responsively.

welcome[bot] avatar Jun 20 '22 04:06 welcome[bot]

It seems memory allocated for data loaded by Loader is not managed well. The data loaded will not be erased after graph unload and session closed.

dzhiwei avatar Jun 22 '22 01:06 dzhiwei

It seems memory allocated for data loaded by Loader is not managed well. The data loaded will not be erased after graph unload and session closed.

We would try to reproduce.

sighingnow avatar Jun 22 '22 01:06 sighingnow

Cannot reproduce.

hi @dzhiwei could you please check if there are still alive vineyard process after OOM? You could find the process by

ps aux | grep vineyard

sighingnow avatar Jun 22 '22 08:06 sighingnow

@sighingnow No OOM error found yet

  • local macOS: data load will failed with error to identify that cannot allocate more memory. and sess.load_from will return a weird graph object without project attribute, then project will failed. I checked vineyard process is still there. engine memory usage high, vineyard memory usage looks good.
  • K8S v0.13.0: data load will failed without root cause error(also no OOM error). and sess.load_from will return a weird graph object without project attribute, then project will failed. I Will check the vineyard container status in the POD and come back later.

my guess is that, analytical engine invoke vineyard IO adapter, after the data loaded into v6d, the shared_ptr of data table may not be destructed? I am not quite familiar with C++

dzhiwei avatar Jun 24 '22 10:06 dzhiwei

I cannot reproduce the error on MacOS.

Could you please try run your script that has the memory issue (the same script that you run on MacOS) inside our container enviroment registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0?

sighingnow avatar Jun 27 '22 05:06 sighingnow

@sighingnow could you add me in ding talk(我的钉钉号: 15y_jt6jpg20d8), i will share the video, data and script for helping re-produce the issue. and help check if it's my code bugs. thank you

dzhiwei avatar Aug 22 '22 01:08 dzhiwei

15y_jt6jpg20d8

You could find me via 13240327026.

Thanks!

sighingnow avatar Aug 22 '22 01:08 sighingnow