STAT for deadlock detection in ML/AI stack
Hello @lee218llnl,
I have been a happy user of STAT for a long time and used STAT for deadlock detection at scale on many systems. Now working with AI/ML stack and wonder if STAT would be also useful for distributed training workloads as well.
Especially, I wonder about the frameworks like PyTorch where NCCL is default backend. Any experience or suggestion about this?
Also, as I don't see many updates on STAT repo, I was wondering if there are other efforts ongoing or alternative tools being developed at LLNL (or outside).
Hi @pramodk good to hear from you again, I recognize your user name from previous queries. There is nothing explicit in STAT for AI/ML applications, but that being said, STAT just gathers stack traces from application processes, so in that sense it will work on AI/ML tasks. I will note, though, that since many of those applications are built on Python, the default STAT usage gathers pure stack traces, so you end up seeing a lot of the python interpreter internals in the traces, rather than the application-user's python script perspective. That said, there are some more experimental features in STAT that can actually extract the python traces and hide the underlying interpreter internals. Finally, regarding NCCL, STAT can use cuda-gdb as its backend to gather stack traces, so you can actually get traces out of NVIDIA GPUs (similarly rocgdb for AMD GPUs).
I add a few incremental things to STAT here and there, but there indeed have not been many additions or updates. I'm open to suggestions if you have ideas, but the idea was always to keep it focused and lightweight.
Hello @lee218llnl!
Somehow, I missed this notification. Sorry!
There is nothing explicit in STAT for AI/ML applications, but that being said, STAT just gathers stack traces from application processes, so in that sense it will work on AI/ML tasks.
Ok, understood. I was searching "MPI" in the codebase and was wondering if there is any implicit dependency or assumption about MPI aspects.
regarding NCCL, STAT can use cuda-gdb as its backend to gather stack traces, so you can actually get traces out of NVIDIA GPUs (similarly rocgdb for AMD GPUs)
Thanks for the reminder! I see that's documented!
That said, there are some more experimental features in STAT that can actually extract the python traces and hide the underlying interpreter internals.
Is that also in the develop branch? I haven't tried or focused on the Python aspects. I will try that.
For the AI/ML, another aspect is the use of container-based execution. In previous work, I haven’t explored STAT + containers. Are there any specific considerations when using STAT with containers? Any examples or relevant resources?
Similar to AI/ML, there isn't really a MPI dependence or MPI-specific feature set. When it comes down to it, MPI tasks are just processes, so we just gather stack traces per usual. The one MPI feature I can think of is in the GUI, once can crop the tree below the user's call to an MPI function. Basically, this hides the actual MPI implementation, since most STAT users are debugging their application, not MPI itself.
For the python traces, please try the pyspy branch.
Regarding containers, I don't have anything specific for this. I'm not sure if you are looking for an image you can leverage or if you are asking for container-specific features. But either way, containers aren't a focus for STAT.