When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic.
Describe the bug
When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic.

Reproduction
from diffusers import DiffusionPipeline from transformers import CLIPFeatureExtractor, CLIPModel import torch
model_id = r"C:\Users\admin\Desktop\stable_diffusion模型合辑\stabilityai_stable-diffusion-2" feature_extractor = CLIPFeatureExtractor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K") clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)
guided_pipeline = DiffusionPipeline.from_pretrained( model_id, custom_pipeline="clip_guided_stable_diffusion", clip_model=clip_model, feature_extractor=feature_extractor, revision="fp16", torch_dtype=torch.float16, ) guided_pipeline.enable_attention_slicing() guided_pipeline = guided_pipeline.to("cuda")
prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"
generator = torch.Generator(device="cuda").manual_seed(0) images = [] for i in range(4): image = guided_pipeline( prompt, num_inference_steps=50, guidance_scale=7.5, clip_guidance_scale=100, num_cutouts=4, use_cutouts=False, generator=generator, ).images[0] images.append(image)
save images locally
for i, img in enumerate(images): img.save(f"./clip_guided_sd/image_{i}.png")
Logs
No response
System Info
diffusers==0.9.0.dev0
Maybe related: https://github.com/huggingface/diffusers/issues/1429#issuecomment-1328385498 I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me.
Maybe related: #1429 (comment) I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me. I tried this solution, but found that the problem was not.Maybe: ` @torch.enable_grad() def cond_fn( self, latents, timestep, index, text_embeddings, noise_pred_original, text_embeddings_clip, clip_guidance_scale, num_cutouts, use_cutouts=True, ): latents = latents.detach().requires_grad_()
if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
sigma = self.scheduler.sigmas[index]
# the model input needs to be scaled to match the continuous ODE formulation in K-LMS
latent_model_input = latents / ((sigma**2 + 1) ** 0.5)
else:
latent_model_input = latents
# predict the noise residual
noise_pred = self.unet(latent_model_input, timestep, encoder_hidden_states=text_embeddings).sample
if isinstance(self.scheduler, (PNDMScheduler, DDIMScheduler)):
alpha_prod_t = self.scheduler.alphas_cumprod[timestep]
beta_prod_t = 1 - alpha_prod_t
# compute predicted original sample from predicted noise also called
# "predicted x_0" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
pred_original_sample = (latents - beta_prod_t ** (0.5) * noise_pred) / alpha_prod_t ** (0.5)
fac = torch.sqrt(beta_prod_t)
sample = pred_original_sample * (fac) + latents * (1 - fac)
elif isinstance(self.scheduler, (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
sigma = self.scheduler.sigmas[index]
sample = latents - sigma * noise_pred
else:
raise ValueError(f"scheduler type {type(self.scheduler)} not supported")
sample = 1 / 0.18215 * sample
image = self.vae.decode(sample).sample
image = (image / 2 + 0.5).clamp(0, 1)
if use_cutouts:
image = self.make_cutouts(image, num_cutouts)
else:
image = transforms.Resize(feature_extractor.size["shortest_edge"])(image)
image = self.normalize(image).to(latents.dtype)
image_embeddings_clip = self.clip_model.get_image_features(image)
image_embeddings_clip = image_embeddings_clip / image_embeddings_clip.norm(p=2, dim=-1, keepdim=True)
if use_cutouts:
dists = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip)
dists = dists.view([num_cutouts, sample.shape[0], -1])
loss = dists.sum(2).mean(0).sum() * clip_guidance_scale
else:
loss = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip).mean() * clip_guidance_scale
grads = -torch.autograd.grad(loss, latents)[0]
if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerAncestralDiscreteScheduler)):
latents = latents.detach() + grads * (sigma**2)
noise_pred = noise_pred_original
else:
noise_pred = noise_pred_original - torch.sqrt(beta_prod_t) * grads
return noise_pred, latents `
Hey @ScottishFold007,
I think we currently sadly don't have the time to look into this problem. We would love to review a PR for the clip guided pipeline though that would make it work with v2.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.