SceneInformer icon indicating copy to clipboard operation
SceneInformer copied to clipboard

about "NaNs in the encoded observations"

Open ccconquer opened this issue 1 year ago • 3 comments

Thanks for your contribution! How can I fix this problem, can you give me some advice? image

ccconquer avatar Sep 30 '24 02:09 ccconquer

By the way, will you upload the validation part, or how can i validate?

ccconquer avatar Sep 30 '24 02:09 ccconquer

First, track down where the NaNs originate. Either the input observations contain some NaNs, or the weights became NaN during training due to instability or NaNs in the loss function.

I’ve used NaNs in the dataset to flag any invalid/illegal observations (e.g., far-away trajectories, occlusions) so that there is no leakage. After the mask is generated:

https://github.com/sisl/SceneInformer/blob/efce1976e939b08eb4608f4eb679a1179926e4e5/sceneinformer/model/encoder.py#L57

They should all be set to 0 to prevent any PyTorch errors: https://github.com/sisl/SceneInformer/blob/efce1976e939b08eb4608f4eb679a1179926e4e5/sceneinformer/model/encoder.py#L72

The same logic applies to polylines, so this issue shouldn't occur unless something has been commented out.

A similar approach is used for the loss function (sceneinformer/model/loss.py):

https://github.com/sisl/SceneInformer/blob/efce1976e939b08eb4608f4eb679a1179926e4e5/sceneinformer/model/loss.py#L23

and decoder:

https://github.com/sisl/SceneInformer/blob/efce1976e939b08eb4608f4eb679a1179926e4e5/sceneinformer/model/decoder.py#L43

Check them as well.

If none of the above applies, then it is most likely caused by training instability, in which case typical solutions should be applied, such as lowering the learning rate, clipping gradients, or increasing precision.

I’ve included the hyperparameters in the config file for easy adjustment: https://github.com/sisl/SceneInformer/blob/efce1976e939b08eb4608f4eb679a1179926e4e5/configs/scene_informer.yaml#L113-L122

Hopefully, that's helpful.

BenQLange avatar Sep 30 '24 20:09 BenQLange

By the way, will you upload the validation part, or how can i validate?

I'll try to find the exact script I've used.

Validation script just compares the FDE/ADE of the generated trajectories and compares the anchor occupancy (like in the loss function).

There is also a visualization script that you can use to get a sense if anything reasonable is happening and plot some random samples from the training/validation splits during training.

BenQLange avatar Sep 30 '24 20:09 BenQLange