diffusers [single file] Cosmos

Possibly fixes #11798

We can run inference with the 7B Text-to-World model with the following code:

import torch
from diffusers import CosmosTextToWorldPipeline, CosmosTransformer3DModel
from diffusers.utils import export_to_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
transformer_single_file = "https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World/blob/main/model.pt"

transformer = CosmosTransformer3DModel.from_single_file(transformer_single_file, torch_dtype=torch.bfloat16).to("cuda")
pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30)

@DN6 I'm not sure I remember how to support different versions of the same model. With the current implementation, if we tried loading the 14B model, it would fail with a weight shape mismatch. This is most likely to do with config-related issues. Could you share some insights?

For Cosmos 1.0 text-to-world and video-to-world models 7B and 14B models, I'll have to make a cosmos-1.0 entry. Another entry cosmos-2.0 for Cosmos Predict2 models. But, what's the normal process for model of same family but different parameter sizes?

Jun 24 '25 20:06 a-r-r-o-w

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 24 '25 20:06 HuggingFaceDocBuilderDev

While I'm not an expert of the diffusers code base as far as I can see, based on WAN which also has multiple parameter counts they're just treated as different model types e.g. in src/diffusers/loaders/single_file_utils.py

        if checkpoint[target_key].shape[0] == 1536:
            model_type = "wan-t2v-1.3B"
        elif checkpoint[target_key].shape[0] == 5120 and checkpoint[target_key].shape[1] == 16:
            model_type = "wan-t2v-14B"
        else:
            model_type = "wan-i2v-14B"

Jun 26 '25 10:06 Vargol

@a-r-r-o-w I think just run a shape check on the params to determine which config to use. I think this should be sufficient to differentiate?

Jun 27 '25 09:06 DN6

@Vargol Could you verify if the latest changes work for you?

Jun 27 '25 20:06 a-r-r-o-w

The Comos 2B single file at https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/resolve/main/model.pt loaded and successfully ran and generated the expected image.

I tried a GGUF file for the 14B version and that didn't work. I'm not sure if that was in scope though. If it was the error is..

$ python cosmos_gguf_prmpts.py 
Multiple distributions found for package optimum. Picked distribution: optimum-quanto
WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work
W0627 23:30:48.574000 85696 lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
The config attributes {'input_types': ['text'], 'model_size': '14b'} were passed to CosmosTransformer3DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/cosmos_gguf_prmpts.py", line 12, in <module>
    transformer = CosmosTransformer3DModel.from_single_file(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/loaders/single_file_model.py", line 420, in from_single_file
    load_model_dict_into_meta(
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/models/model_loading_utils.py", line 285, in load_model_dict_into_meta
    hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name], param)
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/quantizers/gguf/gguf_quantizer.py", line 84, in check_quantized_param_shape
    raise ValueError(
ValueError: patch_embed.proj.weight has an expected quantized shape of: (5120, 68), but received shape: torch.Size([5120, 136])
$

Jun 27 '25 22:06 Vargol

@Vargol Could you share a link to the GGUF checkpoint you're trying to load?

Jul 01 '25 12:07 DN6

@DN6 sorry if you get this multiple times, Github isn't showing any response to the Comment button and I've reloaded the page and there's no sign of my reply.

I got the version I tested from https://huggingface.co/city96/Cosmos-Predict2-14B-Text2Image-gguf/blob/main/cosmos-predict2-14b-text2image-Q5_K_M.gguf

Jul 01 '25 12:07 Vargol