streaming icon indicating copy to clipboard operation
streaming copied to clipboard

clean_stale_shared_memory duplicating the master process when called in a train.py script

Open antoinedandi opened this issue 1 year ago • 2 comments

To reproduce

calling clean_stale_shared_memory() at the beginning of a train.py script itself launched with composer in a distributed setup.

Expected behavior

The memory is cleaned at the beginning of the training and then the training happens normally image

What I get: image The process is duplicated on the GPU:0 and is never destroyed

antoinedandi avatar Apr 26 '24 13:04 antoinedandi

Hmm interesting...normally, you shouldn't need to call clean_stale_shared_memory() at the start of your training script. Is this causing issues during training for you?

snarayan21 avatar May 08 '24 20:05 snarayan21

@antoinedandi "clean_stale_shared_memory() removes stale open shared memory files, but if no stale files are found, it doesn't perform any action. I'm curious if the issue is truly originating from clean_stale_shared_memory(). Do you have a reproducible script we can test?"

karan6181 avatar Jun 14 '24 01:06 karan6181