stable-diffusion.cpp
stable-diffusion.cpp copied to clipboard
black image as video while using Wan 2.2 with Vulkan on OSX
Here's the command-line output:
build/bin/sd -M vid_gen --diffusion-model ../wan2.2_ti2v_5B_fp16.safetensors --vae ../wan2.2_vae.safetensors --t5xxl ../umt5_xxl_fp16.safetensors -p "a walking cat" --cfg-scale 6.0 --sampling-method euler -v -W 480 -H 832 --offload-to-cpu --diffusion-fa --flow-shift 3.0 --clip-on-cpu -n "blurry" --vae-on-cpu --video-frames 16
Option:
n_threads: 4
mode: vid_gen
model_path:
wtype: unspecified
clip_l_path:
clip_g_path:
clip_vision_path:
t5xxl_path: ../umt5_xxl_fp16.safetensors
diffusion_model_path: ../wan2.2_ti2v_5B_fp16.safetensors
high_noise_diffusion_model_path:
vae_path: ../wan2.2_vae.safetensors
taesd_path:
esrgan_path:
control_net_path:
embedding_dir:
photo_maker_path:
pm_id_images_dir:
pm_id_embed_path:
pm_style_strength: 20.00
output_path: output.png
init_image_path:
end_image_path:
mask_image_path:
control_image_path:
ref_images_paths:
control_video_path:
increase_ref_index: false
offload_params_to_cpu: true
clip_on_cpu: true
control_net_cpu: false
vae_on_cpu: true
diffusion flash attention: true
diffusion Conv2d direct: false
vae_conv_direct: false
control_strength: 0.90
prompt: a walking cat
negative_prompt: blurry
clip_skip: -1
width: 480
height: 832
sample_params: (txt_cfg: 6.00, img_cfg: 6.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
moe_boundary: 0.875
flow_shift: 3.00
strength(img2img): 0.75
rng: cuda
seed: 42
batch_count: 1
vae_tiling: false
upscale_repeats: 1
chroma_use_dit_mask: true
chroma_use_t5_mask: false
chroma_t5_mask_pad: 1
video_frames: 16
vace_strength: 1.00
fps: 16
System Info:
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 1
ARM_FMA = 1
F16C = 0
FP16_VA = 1
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:152 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = Apple M4 (MoltenVK) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
[INFO ] stable-diffusion.cpp:208 - loading diffusion model from '../wan2.2_ti2v_5B_fp16.safetensors'
[INFO ] model.cpp:1044 - load ../wan2.2_ti2v_5B_fp16.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '../wan2.2_ti2v_5B_fp16.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:248 - loading t5xxl from '../umt5_xxl_fp16.safetensors'
[INFO ] model.cpp:1044 - load ../umt5_xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '../umt5_xxl_fp16.safetensors', prefix = 'text_encoders.t5xxl.transformer.'
[INFO ] stable-diffusion.cpp:255 - loading vae from '../wan2.2_vae.safetensors'
[INFO ] model.cpp:1044 - load ../wan2.2_vae.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '../wan2.2_vae.safetensors', prefix = 'vae.'
[DEBUG] model.cpp:1784 - patch_embedding_channels 147456
[INFO ] stable-diffusion.cpp:267 - Version: Wan 2.2 TI2V
[INFO ] stable-diffusion.cpp:298 - Weight type: f16
[INFO ] stable-diffusion.cpp:299 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:300 - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:301 - VAE weight type: NONE
[DEBUG] stable-diffusion.cpp:303 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:338 - CLIP: Using CPU backend
[INFO ] stable-diffusion.cpp:342 - Using flash attention in the diffusion model
[INFO ] wan.hpp:2131 - Wan2.2-TI2V-5B
[DEBUG] ggml_extend.hpp:1729 - t5 params backend buffer size = 10835.86 MB(RAM) (242 tensors)
[DEBUG] ggml_extend.hpp:1729 - Wan2.2-TI2V-5B params backend buffer size = 9540.93 MB(RAM) (825 tensors)
[INFO ] stable-diffusion.cpp:456 - VAE Autoencoder: Using CPU backend
[DEBUG] ggml_extend.hpp:1729 - wan_vae params backend buffer size = 1344.24 MB(RAM) (196 tensors)
[DEBUG] stable-diffusion.cpp:565 - loading weights
[DEBUG] model.cpp:1961 - using 4 threads for model loading
[DEBUG] model.cpp:2044 - loading tensors from ../wan2.2_ti2v_5B_fp16.safetensors
|================================> | 825/1263 - 44.45it/s
[DEBUG] model.cpp:2044 - loading tensors from ../umt5_xxl_fp16.safetensors
|==========================================> | 1067/1263 - 21.82it/s
[DEBUG] model.cpp:2044 - loading tensors from ../wan2.2_vae.safetensors
|==================================================| 1263/1263 - 22.83it/s
[INFO ] model.cpp:2288 - loading tensors completed, taking 55.32s (process: 0.00s, read: 54.38s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:661 - total params memory size = 21721.03MB (VRAM 9540.93MB, RAM 12180.10MB): text_encoders 10835.86MB(RAM), diffusion_model 9540.93MB(VRAM), vae 1344.24MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:701 - running in FLOW mode
[DEBUG] stable-diffusion.cpp:725 - finished loaded file
[INFO ] stable-diffusion.cpp:2493 - generate_video 480x832x13
[INFO ] stable-diffusion.cpp:874 - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:894 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:895 - prompt after extract and remove lora: "a walking cat"
[DEBUG] conditioner.hpp:1267 - parse 'a walking cat' to [['a walking cat', 1], ]
[DEBUG] t5.hpp:402 - token length: 512
[DEBUG] ggml_extend.hpp:1553 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 32373 ms
[DEBUG] conditioner.hpp:1267 - parse 'blurry' to [['blurry', 1], ]
[DEBUG] t5.hpp:402 - token length: 512
[DEBUG] ggml_extend.hpp:1553 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 22357 ms
[INFO ] stable-diffusion.cpp:2757 - get_learned_condition completed, taking 54737 ms
[DEBUG] stable-diffusion.cpp:2819 - sample 30x52x4
[INFO ] ggml_extend.hpp:1653 - Wan2.2-TI2V-5B offload params (9540.93 MB, 825 tensors) to runtime backend (Vulkan0), taking 42.38s
[DEBUG] ggml_extend.hpp:1553 - Wan2.2-TI2V-5B compute buffer size: 156.82 MB(VRAM)
|==================================================| 20/20 - 24.16s/it
[INFO ] stable-diffusion.cpp:2846 - sampling completed, taking 519.22s
[INFO ] stable-diffusion.cpp:2867 - generating latent video completed, taking 519.50s
[DEBUG] ggml_extend.hpp:1553 - wan_vae compute buffer size: 20112.94 MB(RAM)
[DEBUG] stable-diffusion.cpp:1547 - computing vae decode graph completed, taking 790.34s
[INFO ] stable-diffusion.cpp:2870 - decode_first_stage completed, taking 790.34s
[INFO ] stable-diffusion.cpp:2890 - generate_video completed in 1364.57s
save result MJPG AVI video to 'output.avi'