Question about settings and results.
Thank you for your implementation!
However I have some questions about settings and results.
I use your pretained encoder.pt and weight_offsets.pt on celebahq and FFFHQ.
I had to use the Bitsandbytes package to save GPU memory, because even if I changed the batchsize to 2, I still didn't have enough memory on a 24GB RTX 3090. So I added --gradient_checkpointing and --use_8bit_adam at the end of the command.
I use this image as input, and set batchsize=16, gradient_accumulation_steps=1. That is, the default setting.
CUDA_VISIBLE_DEVICES=1 accelerate launch tuning_e4t.py \
--pretrained_model_name_or_path="ckpt" \
--prompt_template="a photo of {placeholder_token}" \
--reg_lambda=0.1 \
--output_dir="fine_tuned_test2" \
--train_image_path="input/1439.jpg" \
--resolution=512 \
--train_batch_size=16 \
--gradient_accumulation_steps 1 \
--learning_rate=1e-6 --scale_lr \
--max_train_steps=30 \
--mixed_precision="fp16" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam
Then i use prompt "*s on the beach" and "*s in a Santa hat", command is shown as blow:
python inference.py \
--pretrained_model_name_or_path "fine_tuned_test2" \
--prompt "*s on the beach" \
--num_images_per_prompt 4 \
--scheduler_type "ddim" \
--image_path_or_url "input/1439.jpg" \
--num_inference_steps 50 \
--guidance_scale 7.5
I get poor results, the first row corresponding to "*s on the beach" and the second row corresponding to "*s in a Santa hat":
Then I change gradient_accumulation_steps=2 and train_batch_size=16:
Why does the output either copy the input image or produce a messy image? Does it seem that the semantics in prompt are not generated? Is the effect of the paper actually difficult to reproduce or is there something wrong with my Settings?