Describe the bug

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic.

Reproduction

from diffusers import DiffusionPipeline from transformers import CLIPFeatureExtractor, CLIPModel import torch

model_id = r"C:\Users\admin\Desktop\stable_diffusion模型合辑\stabilityai_stable-diffusion-2" feature_extractor = CLIPFeatureExtractor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K") clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)

guided_pipeline = DiffusionPipeline.from_pretrained( model_id, custom_pipeline="clip_guided_stable_diffusion", clip_model=clip_model, feature_extractor=feature_extractor, revision="fp16", torch_dtype=torch.float16, ) guided_pipeline.enable_attention_slicing() guided_pipeline = guided_pipeline.to("cuda")

prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"

generator = torch.Generator(device="cuda").manual_seed(0) images = [] for i in range(4): image = guided_pipeline( prompt, num_inference_steps=50, guidance_scale=7.5, clip_guidance_scale=100, num_cutouts=4, use_cutouts=False, generator=generator, ).images[0] images.append(image)

save images locally

for i, img in enumerate(images): img.save(f"./clip_guided_sd/image_{i}.png")

Logs

No response

System Info

diffusers==0.9.0.dev0

Nov 25 '22 14:11 ScottishFold007

Maybe related: https://github.com/huggingface/diffusers/issues/1429#issuecomment-1328385498 I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me.

Nov 28 '22 20:11 akirchmeyer

Maybe related: #1429 (comment) I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me. I tried this solution, but found that the problem was not.Maybe： ` @torch.enable_grad() def cond_fn( self, latents, timestep, index, text_embeddings, noise_pred_original, text_embeddings_clip, clip_guidance_scale, num_cutouts, use_cutouts=True, ): latents = latents.detach().requires_grad_()

    if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        sigma = self.scheduler.sigmas[index]
        # the model input needs to be scaled to match the continuous ODE formulation in K-LMS
        latent_model_input = latents / ((sigma**2 + 1) ** 0.5)
    else:
        latent_model_input = latents

    # predict the noise residual
    noise_pred = self.unet(latent_model_input, timestep, encoder_hidden_states=text_embeddings).sample

    if isinstance(self.scheduler, (PNDMScheduler, DDIMScheduler)):
        alpha_prod_t = self.scheduler.alphas_cumprod[timestep]
        beta_prod_t = 1 - alpha_prod_t
        # compute predicted original sample from predicted noise also called
        # "predicted x_0" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
        pred_original_sample = (latents - beta_prod_t ** (0.5) * noise_pred) / alpha_prod_t ** (0.5)

        fac = torch.sqrt(beta_prod_t)
        sample = pred_original_sample * (fac) + latents * (1 - fac)
    elif isinstance(self.scheduler,  (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        sigma = self.scheduler.sigmas[index]
        sample = latents - sigma * noise_pred
    else:
        raise ValueError(f"scheduler type {type(self.scheduler)} not supported")

    sample = 1 / 0.18215 * sample
    image = self.vae.decode(sample).sample
    image = (image / 2 + 0.5).clamp(0, 1)

    if use_cutouts:
        image = self.make_cutouts(image, num_cutouts)
    else:
        image = transforms.Resize(feature_extractor.size["shortest_edge"])(image)
    image = self.normalize(image).to(latents.dtype)

    image_embeddings_clip = self.clip_model.get_image_features(image)
    image_embeddings_clip = image_embeddings_clip / image_embeddings_clip.norm(p=2, dim=-1, keepdim=True)

    if use_cutouts:
        dists = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip)
        dists = dists.view([num_cutouts, sample.shape[0], -1])
        loss = dists.sum(2).mean(0).sum() * clip_guidance_scale
    else:
        loss = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip).mean() * clip_guidance_scale

    grads = -torch.autograd.grad(loss, latents)[0]

    if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        latents = latents.detach() + grads * (sigma**2)
        noise_pred = noise_pred_original
    else:
        noise_pred = noise_pred_original - torch.sqrt(beta_prod_t) * grads
    return noise_pred, latents   `

Nov 29 '22 15:11 ScottishFold007

Hey @ScottishFold007,

I think we currently sadly don't have the time to look into this problem. We would love to review a PR for the clip guided pipeline though that would make it work with v2.

Dec 01 '22 16:12 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 26 '22 15:12 github-actions[bot]