profiler Idle Time

Hi everyone, I started using the Tensorflow profiler, which i found very useful, with the tutorial (https://github.com/tensorflow/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb) and with a custom model. In both cases, the idle time in Tensorflow stats is about 90%: is this normal? Why the option "Include idle time" is not the default one?

Aug 04 '20 10:08 piepor

Is the idle time in TensorFlow Stats on host? or on device? If it is on host, it is probably okay. It means that your model doesn't use the host much. If it is on device, it means that your accelerator is largely not used. You probably want to increase its utilization. "Include idle time" is not the default because many users of TensorFlow Stats want to visualize the relative timing of actual ops. If we include the idle time, the actual-op portions may become too small to visualize clearly.

Aug 04 '20 16:08 ckluk-github

Thanks for the quick response, It was on both sides: the device and the host. Unfortunatly I can't provide logs in these days but i will as soon as possible, hoping that you could suggest me how to increase gpu utilization. I have another question: is there a guide, conference, course or something else where i can learn to interprete the logs and the trace view of the profiler? This way i can try to solve my code optimization asking here as little as possible.

Aug 05 '20 14:08 piepor

We are working on the guide, hopefully to be available soon. Thanks -ck

On Wed, Aug 5, 2020 at 7:22 AM piepor [email protected] wrote:

Thanks for the quick response, It was on both sides: the device and the host. Unfortunatly I can't provide logs in these days but i will as soon as possible, hoping that you could suggest me how to increase gpu utilization. I have another question: is there a guide, conference, course or something else where i can learn to interprete the logs and the trace view of the profiler? This way i can try to solve my code optimization asking here as little as possible.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/profiler/issues/120#issuecomment-669222983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33L3MG7ZB6H23MQPUQRFLR7FTMFANCNFSM4PUIWQ2Q .

Aug 05 '20 17:08 ckluk

As for the guide, you can look at the current one (https://www.tensorflow.org/guide/profiler#profiler_tools). We are working on a more detailed one with examples.

For your issue, what is the step-time breakdown shown on the Overview Page?

Aug 05 '20 17:08 ckluk-github

We are working on the guide, hopefully to be available soon. Thanks -ck

Ok great!

step-time

This is my full profile log profile_logs.zip I think that the problem is that my model is a really little one, so most of the time is spent in launching kernels. This is also supported by the fact that using only the CPU speeds up the training. Is there a way to effectively use the GPU? Or for this model I have to give up using only the CPU? Thank you very much

Aug 17 '20 17:08 piepor

@ckluk any more advice here? I am facing the same issue - the profiler tool is great, but it is very hard to optimise the kernel launch time, any more advice or guides in this area?

Jan 24 '21 16:01 ydennisy

@ckluk-github hello, i use cpu train the model, the idle time is 97.5% is it normal? thanks

Aug 05 '21 06:08 siwang2011

@ckluk-github , I am repeating question from @siwang2011 , If we are using CPU for training, is idle time ~90% normal?

Jul 05 '22 23:07 nithish08