Train GLIGEN in diffusers?
Thanks for your great job!!!
Now, I know how to infer GLIGEN with diffusershttps://github.com/gligen/diffusers/tree/gligen/examples/gligen. But how can I train GLIGEN with diffusers like ControlNethttps://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py?
Thanks again.
We don't have bandwidth to work on the training scripts for GLIGEN. Can open it up to the community.
Hi @sayakpaul I have tried the code in url
boxes = [[0.4, 0.2, 1.0, 0.8], [0.0, 1.0, 0.0, 1.0]] # Set `[0.0, 1.0, 0.0, 1.0]` for the style
The bounding boxes should be in the format of [xmin, ymin, xmax, ymax]. I am confused about this point. I think the right box for the style may be [0, 0, 1, 1].
Cc: @tuanh123789 could you help?
Cc: @tuanh123789 could you help?
Ok I'll check
@Hzzone In origin Gligen repo, the author using [xmin, ymin, xmax, ymax] as [x0, y0, x1, y1]. When using style, they pass [0.0, 1.0, 0.0, 1.0] reference image location. So the Gligen implement in Diffusers is the same
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @sayakpaul I have successfully trained GLIGEN like ControlNet. Could I make a contribution with respect to this issue?
Thanks for your interest and for working on this. Out of curiosity, could we see some results you're getting with your trained model?
In any case, feel free to open a PR adding a training script to https://github.com/huggingface/diffusers/tree/main/examples/research_projects/.
I trained the model on COCO dataset with 100k iterations, 64 batch size, using GroundingDINO and BLIP2 to label instances. Prompt:
prompt = 'A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky'
gen_boxes = [('a green car', [21, 281, 211, 159]), ('a blue truck', [269, 283, 209, 160]), ('a red air balloon', [66, 8, 145, 135]), ('a bird', [296, 42, 143, 100])]
# prompt = 'A realistic top-down view of a wooden table with two apples on it'
# gen_boxes = [('a wooden table', [20, 148, 472, 216]), ('an apple', [150, 226, 100, 100]), ('an apple', [280, 226, 100, 100])]
# prompt = 'A realistic scene of three skiers standing in a line on the snow near a palm tree'
# gen_boxes = [('a skier', [5, 152, 139, 168]), ('a skier', [278, 192, 121, 158]), ('a skier', [148, 173, 124, 155]), ('a palm tree', [404, 105, 103, 251])]
# prompt = 'An oil painting of a pink dolphin jumping on the left of a steam boat on the sea'
# gen_boxes = [('a steam boat', [232, 225, 257, 149]), ('a jumping pink dolphin', [21, 249, 189, 123])]
import numpy as np
boxes = np.array([x[1] for x in gen_boxes])
boxes = boxes / 512
boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
boxes = boxes.tolist()
gligen_phrases = [x[0] for x in gen_boxes]
Here are the results:
And the results of the same prompt produced by pretrained GLIGEN model:
I have also tried the training data provided by GLIGEN, and achieved similar results with 500k iterations. It seems that this model is inferior to the model trained on COCO. Unfortunately, I have not quantitatively evaluated the model.
Wow, those are very good results. Please feel to start the contribution.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.