Stable Cascade Image to Image Pipeline
Are there any plans for Image2Image pipeline for the StableCascade model ?
In theory, it should be doable with StableCascadeCombinedPipeLine. It accepts an `images' argument that can be a PIL image, a torch tensor, or a list of either. Unfortunately, I can't get it to accept a bfloat16 type for the image. It raises a runtime error in CLIP. I tried float32, but HF's A10G Large runs out of memory.
Although I'm an experienced coder, I don't think I can justify the time it would take me to dig deeply enough to come up with a fix. Hope someone else knows enough to concoct a solution.
Here's the error I'm seeing when I try to pass an image encoded as torch.bfloat16
File "/home/user/app/app.py", line 50, in generate_image
results = pipe(
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py", line 268, in __call__
prior_outputs = self.prior_pipe(
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_prior.py", line 504, in __call__
image_embeds_pooled, uncond_image_embeds_pooled = self.encode_image(
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_prior.py", line 254, in encode_image
image = self.feature_extractor(image, return_tensors="pt").pixel_values
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 551, in __call__
return self.preprocess(images, **kwargs)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/models/clip/image_processing_clip.py", line 306, in preprocess
images = [to_numpy_array(image) for image in images]
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/models/clip/image_processing_clip.py", line 306, in <listcomp>
images = [to_numpy_array(image) for image in images]
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/image_utils.py", line 174, in to_numpy_array
return to_numpy(img)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/utils/generic.py", line 308, in to_numpy
return framework_to_numpy[framework](obj)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/transformers/utils/generic.py", line 293, in <lambda>
"pt": lambda obj: obj.detach().cpu().numpy(),
TypeError: Got unsupported ScalarType BFloat16
FWIW, here are relevant snippets from the code that produced the above error:
# Define a transform to convert a PIL (method given by Claude 3 Sonnet)
def transform(image):
# Convert the image to a PyTorch tensor
input_tensor = torch.from_numpy(np.array(image)).permute(2, 0, 1).unsqueeze(0)
# Convert the tensor to 'bfloat16' dtype
input_tensor = input_tensor.to(torch.bfloat16)
return input_tensor
# Ensure model and scheduler are initialized in GPU-enabled function
if torch.cuda.is_available():
pipe = StableCascadeCombinedPipeline.from_pretrained(repo, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# The generate function
@spaces.GPU(enable_queue=True)
def generate_image(prompt, image):
if image is not None:
# Convert the PIL image to Torch tensor
# and move it to GPU
img_tensor = transform(image)
img_tensor = [img_tensor.to("cuda")]
else:
img_tensor=None
seed = random.randint(-100000,100000)
results = pipe(
prompt=prompt,
images=img_tensor,
height=1024,
width=1024,
num_inference_steps=20,
generator=torch.Generator(device="cuda").manual_seed(seed)
)
return results.images[0]
See https://github.com/huggingface/diffusers/issues/7598#issuecomment-2042897916 which contains a minimal working app.py for img2img using StableCascadeCombinedPipeline
The issue turned out to be that pipe.to('cuda') does not move the prior image encoder to cuda. An extra line is needed to do it manually.
cc @kashif here
does it make sense to make a img2img pipeline for Stable Cascade?
from what I understand that image argument in Stable Cascade has a similar role as prompt so it does not work the same way as img2img pipeline
cc @kashif here does it make sense to make a img2img pipeline for Stable Cascade? from what I understand that
imageargument in Stable Cascade has a similar role aspromptso it does not work the same way as img2img pipeline
I'd like to understand that in more depth. Empirically, passing an image and a prompt is doing exactly what I expect. I get a result that's clearly based on the input image and influenced by the prompt. Here are three images.
- Stable Cascade's output when prompted with "Barad Dur".
- A photo I took of the Sir Walter Scott monument in Edinburgh.
- The output of prompting with "Barad Dur" and supplying my photo as an image input.
To me, the third image, while not very exciting, is clearly derived from the photo but with the monument enlarged and restyled in a way that's consistent with Stable Cascade's concept of Barad Dur.
Prompt only
Photo
Prompt + Photo
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing this issue because of inactivity. Feel free to reopen.