diffusers Train GLIGEN in diffusers?

Thanks for your great job!!!

Now, I know how to infer GLIGEN with diffusershttps://github.com/gligen/diffusers/tree/gligen/examples/gligen. But how can I train GLIGEN with diffusers like ControlNethttps://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py?

Thanks again.

Mar 18 '24 06:03 Strike1999

We don't have bandwidth to work on the training scripts for GLIGEN. Can open it up to the community.

Mar 18 '24 09:03 sayakpaul

Hi @sayakpaul I have tried the code in url

boxes = [[0.4, 0.2, 1.0, 0.8], [0.0, 1.0, 0.0, 1.0]]  # Set `[0.0, 1.0, 0.0, 1.0]` for the style

The bounding boxes should be in the format of [xmin, ymin, xmax, ymax]. I am confused about this point. I think the right box for the style may be [0, 0, 1, 1].

Mar 27 '24 05:03 Hzzone

Cc: @tuanh123789 could you help?

Mar 27 '24 05:03 sayakpaul

Cc: @tuanh123789 could you help?

Ok I'll check

Mar 27 '24 06:03 tuanh123789

@Hzzone In origin Gligen repo, the author using [xmin, ymin, xmax, ymax] as [x0, y0, x1, y1]. When using style, they pass [0.0, 1.0, 0.0, 1.0] reference image location. So the Gligen implement in Diffusers is the same

Mar 27 '24 06:03 tuanh123789

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 20 '24 15:04 github-actions[bot]

Hi @sayakpaul I have successfully trained GLIGEN like ControlNet. Could I make a contribution with respect to this issue?

Apr 30 '24 05:04 Hzzone

Thanks for your interest and for working on this. Out of curiosity, could we see some results you're getting with your trained model?

In any case, feel free to open a PR adding a training script to https://github.com/huggingface/diffusers/tree/main/examples/research_projects/.

Apr 30 '24 13:04 sayakpaul

I trained the model on COCO dataset with 100k iterations, 64 batch size, using GroundingDINO and BLIP2 to label instances. Prompt:

prompt = 'A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky'
gen_boxes = [('a green car', [21, 281, 211, 159]), ('a blue truck', [269, 283, 209, 160]), ('a red air balloon', [66, 8, 145, 135]), ('a bird', [296, 42, 143, 100])]

# prompt = 'A realistic top-down view of a wooden table with two apples on it'
# gen_boxes = [('a wooden table', [20, 148, 472, 216]), ('an apple', [150, 226, 100, 100]), ('an apple', [280, 226, 100, 100])]

# prompt = 'A realistic scene of three skiers standing in a line on the snow near a palm tree'
# gen_boxes = [('a skier', [5, 152, 139, 168]), ('a skier', [278, 192, 121, 158]), ('a skier', [148, 173, 124, 155]), ('a palm tree', [404, 105, 103, 251])]

# prompt = 'An oil painting of a pink dolphin jumping on the left of a steam boat on the sea'
# gen_boxes = [('a steam boat', [232, 225, 257, 149]), ('a jumping pink dolphin', [21, 249, 189, 123])]

import numpy as np

boxes = np.array([x[1] for x in gen_boxes])
boxes = boxes / 512
boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
boxes = boxes.tolist()
gligen_phrases = [x[0] for x in gen_boxes]

Here are the results:

And the results of the same prompt produced by pretrained GLIGEN model:

I have also tried the training data provided by GLIGEN, and achieved similar results with 500k iterations. It seems that this model is inferior to the model trained on COCO. Unfortunately, I have not quantitatively evaluated the model.

May 01 '24 04:05 Hzzone

Wow, those are very good results. Please feel to start the contribution.

May 01 '24 04:05 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]