Add VAE to txt-to-speech Inference
Hey hey!
So I am using some models that either have VAE baked in or require a separate VAE to be defined during inference like this:
model = "CompVis/stable-diffusion-v1-4"
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
pipe = StableDiffusionPipeline.from_pretrained(model, vae=vae)
when I either manually added the vae or used a model with a vae baked in for the MODEL_ID, I received the following error, for example with the model dreamlike-art/dreamlike-photoreal-2.0
'name': 'RuntimeError', 'message': 'Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same', 'stack': 'Traceback (most recent call last):\n File "/api/app.py", line 382, in inference\n images = pipeline(**model_inputs).images\n File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context\n return func(*args, **kwargs)\n File "/api/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 606, in __call__\n noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds).sample\n File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl\n return forward_call(*input, **kwargs)\n File "/api/diffusers/src/diffusers/models/unet_2d_condition.py", line 475, in forward\n sample = self.conv_in(sample)\n File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl\n return forward_call(*input, **kwargs)\n File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward\n return self._conv_forward(input, self.weight, self.bias)\n File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward\n return F.conv2d(input, weight, bias, self.stride,\nRuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
Line 382 in the inference function which looks like this:
images = pipeline(**model_inputs).images
Perhaps we need to add a .half() to the input somewhere, not sure where. though.
Any help would be greatly appreciated!
It's the last hurdle I am facing to be generating images.
IDEA: It would be awesome if we could define an optional VAE when making API call like this:
model_inputs["callInputs"] = {
"MODEL_ID": "runwayml/stable-diffusion-v1-5",
"PIPELINE": "StableDiffusionPipeline",
"SCHEDULER": self.scheduler,
"VAE": "stabilityai/sd-vae-ft-mse"
}
Hey, @digiphd! Thanks for getting this on my radar. I'll have a chance to take a look during this coming week.
As a preliminary comment, I like the idea of being able to switch the VAE at runtime, although there will be a lot of work involved to adapt how we currently cache models.
P.S. If you're impatient, in the meantime, I think you could probably:
- Clone https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16
- Replace the
vaedirectory with the contents from https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main - Upload that "new" model back to HuggingFace and build docker-diffusers-api with that (it's possible without uploading back to huggingface, but a bit more complicated).
Alternatively, with your current setup, it's possible that if you set MODEL_PRECISION="" and MODEL_REVISION="", you might get past that error by using full precision (but inference will be slower; nevertheless, maybe something useful in the interim).
Anyways, have a great weekend and we'll be in touch next week :grinning:
Hey @gadicc great, thanks for your suggestions I will give them ago! You're a legend!
Another thing I was wondering, was if docker-diffusers-api text-to-image supports negative keywords?
I did put it as an argument and it seemed to negatively affect the output images.
Yup! negative_prompt modelInput, as it seems you worked out.
The modelInput's are passed directly to the relevant diffusers' pipeline, so you can use whatever arguments are supported by that pipeline. I made this a little clearer in the README a few days ago with links to the common diffusers pipelines, as I admit it wasn't so obvious until then :sweat_smile:
There's also a note there now about using the lpw_stable_diffusion pipeline which supports longer prompts and prompt weights.
Thanks for all the kind words! :raised_hands:
Hey @digiphd, I had a quick moment to try dreamlike-art/dreamlike-photoreal-2.0 and it works out the box for me, in both full and half precision. What version of docker-diffusers-api are you using?
These worked for me:
$ python test.py txt2img --call-arg MODEL_ID="dreamlike-art/dreamlike-photoreal-2.0" --call-arg MODEL_PRECISION=""
$ python test.py txt2img --call-arg MODEL_ID="dreamlike-art/dreamlike-photoreal-2.0" --call-arg MODEL_PRECISION="fp16"
I just tried in the default "runtime" config. If you have this issue specifically in the -build-download variant, let me know.
Related: https://github.com/kiri-art/docker-diffusers-api/issues/26