pti-gpu icon indicating copy to clipboard operation
pti-gpu copied to clipboard

[Unitrace] Tool always aborts by assertion error in 'UniTracer::Create' when tried to profile on python scripts

Open xunsongh opened this issue 1 year ago • 5 comments

I built unitrace tool on PVC machine with driver agama-ci-devel-hotfix-821.36 by default without MPI support, and then try to run this tool on a simple python script, but it always be aborted by the assertion error in UniTracer::Create.

Here is my command to run the successfully built unitrace tool:

./unitrace -h python ./simple.py
./unitrace --chrome-kernel-logging --chrome-dnn-logging --chrome-ccl-logging python ./simple.py

Also I tried other options in running but all of them failed on such an assertion error:

python: /home/gta/pti-gpu/tools/unitrace/src/tracer.h:50: static UniTracer* UniTracer::Create(const TraceOptions&): Assertion `status == ZE_RESULT_SUCCESS' failed.
Aborted (core dumped)

My test case is simplest as could:

if __name__ == '__main__':
    a = 1

Would you please help check why the unitrace tool crashed on such a simple case who is even not related to SYCL or L0?

xunsongh avatar May 20 '24 10:05 xunsongh

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Sarbojit2019 avatar Jun 05 '24 08:06 Sarbojit2019

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

xunsongh avatar Jun 07 '24 03:06 xunsongh

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

Regarding your response to "Item 1" I doubt if this is related to python app. As per the failure point it looks to be at the very beginning. Lets connect internally to see the setup and failure.

Sarbojit2019 avatar Jun 07 '24 03:06 Sarbojit2019

@xunsongh Please check the version of libstdc++.so in you conda env. If it is lower than 6.0.30, you need to upgrade it at least 6.0.30

zma2 avatar Jun 24 '24 15:06 zma2