djl icon indicating copy to clipboard operation
djl copied to clipboard

A helping issue on how to debug my model

Open warthecatalyst opened this issue 3 years ago • 2 comments

Description

I'm currently building yolov3 model in DJL. And currently using Pytorch Engine to test my yolov3 model and loss function. In actual running the output of my model becomes NAN for 4 epochs... However, when I use debug mode to look into it line by line. It just goes well for 30+ epochs... Moreover, I've tried strategies below: 1. It may come from gradient explosion or gradient disappearance, but I've added batchNorm layers in my model 2. It may come from log(x) where x maybe lower lte 0. And I've added a small number EPISILON for log so it should not be lte 0

warthecatalyst avatar Sep 11 '22 13:09 warthecatalyst

If NAN happens only in the middle of the training not the beginning, then it is a little troublesome to debug. In this case, I would print the gradient of the intermediate variables (aka parameter) and see when the value starts to become abnormal. (It looks like you have already started this and still couldn't solve the bug. But generally, the way to monitor the training of the network is to check the intermediate parameter value, output of each layer and the parameters' gradient)

If NAN or abnormal value already starts at the first epoch, then it is easier to debug. Just try to simplify the network to a version that works, and then add it a bit by a bit, and see which addition causes the bug.

KexinFeng avatar Sep 18 '22 06:09 KexinFeng

thanks for the advice :D

At 2022-09-18 14:52:19, "KexinFeng" @.***> wrote:

If NAN happens only in the middle of the training not the beginning, then it is a little troublesome to debug. In this case, I would print the gradient of the intermediate variables (aka parameter) and see when the value starts to become abnormal. (It looks like you have already started this and still couldn't solve the bug. But generally, the way to monitor the training of the network is to check the intermediate parameter value, output of each layer and the parameters' gradient)

If NAN or abnormal value already starts at the first epoch, then it is easier to debug. Just try to simplify the network to a version that works, and then add it a bit by a bit, and see which addition causes the bug.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

warthecatalyst avatar Sep 18 '22 08:09 warthecatalyst