stablediffusion [stable-diffusion-x4-upscaler] Use pretrain VAE to encode a 512x512 image to latent space get nan, the image has been normalized to [-1,1]

I have downloaded the stable-diffusion-x4-upscaler pre-train model from https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler

I try to fine-tune the upscaler model with my own data, however, I find when I encode the 512x512 image to latent space 128x128 with the pretrain VAE parameter, I get nan with size [b,4,128,128].

I have tracked the VAE forward function. I find that following the calculation map, the data will soon become huge and data overflow will happen.

I use the stable diffusion fine-tuning script in the following link and modify the script with my own dataset since there is no finetuning script for this x4-upscaler model. https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py

Is there any solution for this error?