GaussianCube Found NaN, decreased lg_loss

Great work! I am trying to reproduce your results on ShapeNet Chair, and encountered multiple warnings: Found NaN, decreased lg_loss_scale to ..... this keeps showing throughout the entire training process. Is this normal?

Pasting the latest training log here for your reference: Screenshot 2024-08-22 at 9 43 40 AM

Aug 22 '24 16:08 tiangexiang

Hi, it is normal to encounter this warning. The training will proceed normally when lg_loss_scale is greater than 0.

Aug 27 '24 01:08 GaussianCube

@GaussianCube Thanks for the reply! I have few more questions regarding the training parameters.

In the training script you provided (https://github.com/GaussianCube/GaussianCube?tab=readme-ov-file#unconditional-diffusion-training-on-shapenet-car-or-shapenet-chair), the start idx and end idx are specified, leading to only 100 objects been used for training. Is this desired?
When training from scratch, photometric loss should not be enabled until 50,000 steps. However, according to your training log posted here (https://github.com/GaussianCube/GaussianCube/issues/6#issuecomment-2172789100), both pixel_loss and vgg_loss start from step 0.
During my training, I observed much higher pixel_loss (~0.16) and vgg_loss (~1.1) at step 100,000. Does this mean the training is wrong or the loss weight should be lower?

Aug 27 '24 17:08 tiangexiang

Hi @tiangexiang,

Thanks for your questions and observations.

The start_idx and end_idx values in the script were simply set for demonstration purposes. You can certainly adjust these values to include more or even all available data in your training set.
On the subject of enabling photometric loss, we have indeed found that enabling it after 50,000 steps yields more stable training results in our recent experiments. We have updated our script to reflect this finding.
Regarding your higher pixel_loss and vgg_loss at step 100,000, it's hard to definitively say if there's an issue without more information. As an initial approach, you might consider training on a smaller dataset (such as the 100 objects mentioned earlier) for a longer period of time. This could help you determine if the photometric loss will decrease as expected over time. Additionally, you can inspect the images saved in logdir/train_images to get a better sense of how your model is performing during training.

Please let me know if you have any other questions or if there's anything else I can assist with.

Aug 28 '24 01:08 GaussianCube

Thanks again for your kind reply. I have tried to train the model on ShapeNet Chair under exactly the same config as you provided. However, I am still getting noisy renderings even after ~300K steps of training (see below two rendering examples): rank_00_render_000003_cam_07 rank_00_render_000001_cam_51 I wonder if this is expected? Or there is something wrong during training?

Aug 30 '24 05:08 tiangexiang

The rendering results do not seem to be as expected. I think you can first check that the ground truth GaussianCube (micro in https://github.com/GaussianCube/GaussianCube/blob/main/train.py#L258) can produce a reasonable rendering, then ensure that the bound used in fitting and diffusion is the same. I will double check the training code as well.

Sep 02 '24 09:09 GaussianCube