Seungwoo, Jeong issues

Repositories
Issues
Comments

Results 3 issues of


                                            Seungwoo, Jeong

BLIP-2 input image size setting (image captioning)

in the BLIP-2 paper, "We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM. It extracts a fixed number of...

Will the image-to-audio model be open?

The most surprising part of AudioLDM2 was the results of converting images to audio. Will this be a future release?

What are the recommended hardware specifications for inference?

How much VRAM need for inference? And can you recommend minimum specific GPU for generating videos? Thanks for open-sourcing this!