Use `.item()` to collect results and losses-as-results only at an epoch's end
Using .item() to store results in routine call forces GPU to synchronize in order to have access at a lazy-evaluated Python number. This is suboptimal as kernel scheduling (CPU load) and kernel execution (GPU load) should be as parallel pipelines as possible, resulting in delays in the opposite case.
On the other hand, we need _all_epoch_results at the end of an epoch for visualization purposes.
As @obilaniu has noted elsewhere, it's better to use .detach() to store results within a training step, and
then let's process results+losses-as-results internally to get the Python/Numpy values, at the moment
they are actually needed - that's the end of an epoch.
I see that fetching is certainly done properly at the end of the epoch with _lib.utils.convert_to_numpy function.
Then, using .detach() instead of .item() in model plugin implementation is sufficient I think.
Interesting, I wasn't aware of this distinction. Is there a backend solution that can manage this, or is it up to the user when they design routines?
I have implemented it a solution, with a function nested_detach and apply it to the per routine isolated results. I will make PR soon. So now the user should, either provide just a float, or numpy ndarray, or torch tensor (detached or not).