profiler icon indicating copy to clipboard operation
profiler copied to clipboard

TensorFlow profiler running into OOM issue on GPU

Open rahul-fnu opened this issue 2 years ago • 3 comments

Running TensorFlow profiler for longer than 10 second period results into OOM error, crashes the inference process and the profiler returns DEADLINE_EXCEEDED. Is there anyway to limit the sampling rate or way to reduce the amount of information being collected to avoid crashing the process?

Here is the code that I run: tensorflow_profiler.experimental.client("grpc://localhost:3222", "profiles", 30000)

rahul-fnu avatar Aug 10 '23 04:08 rahul-fnu

Hi Tensorflow team

Can you help us with above? Is there a way to sample TensorFlow profiling on GPUs? This is blocking us from collecting any traces greater than 10s

ndeepesh avatar Aug 11 '23 18:08 ndeepesh

Have you tried to do this with keras callbacks using something like this:

tensorboard_callback = tf.keras.callbacks.TensorBoard(                                                                                                                                    
          log_dir=fn_args.model_run_dir, profile_batch= (40,80), histogram_freq=1, write_steps_per_second=True, write_graph=False)

And passing the callback within model.fit?

pritamdodeja avatar Aug 27 '23 10:08 pritamdodeja

@rahul-fnu To limit the sampling rate or reduce the amount of information collected by the TensorFlow profiler, you can adjust the sampling_rate parameter in the tensorflow_profiler.experimental.client function. Use- tensorflow_profiler.experimental.client("grpc://localhost:3222", "profiles", 30000, sampling_rate=0.5, events=["compute"])

Rahulraj0308 avatar Feb 07 '24 18:02 Rahulraj0308