LTX-Video Multi GPU inference on 13b-dev base model?

Hi, I use single 48GB video card to run 13b model, while failed. I was wondering is there proj for multi-gpu inference?

May 09 '25 07:05 StarsTesla

you should post how you run it, what file? what command? what error, and how big is the video, saying "failed" does not provide any context.

i can run 13B on RTX 4090 48GB to generate 1216x704 88frames without offloading to CPU

May 11 '25 01:05 eisneim

Just OOM while loading the 13B（not fp8 version）model. I am using L20.

May 11 '25 07:05 StarsTesla

@StarsTesla i can run 13B in bf16 on RTX 4090 48GB to generate 1216x704 88frames without offloading to CPU, how can you not been able to load it? can you share your loading code?

May 12 '25 06:05 eisneim

@eisneim

` git log commit 6e45836dda8d340045efb97cd7b044c6a4696542 (HEAD -> main, origin/main, origin/HEAD) Merge: 27de2ef b3c857a Author: Yoav HaCohen [email protected] Date: Wed May 7 13:03:45 2025 +0300

Merge pull request #157 from Lightricks/feature/fix-pyproject

Install: Fix install to include ltx_video directory only

commit b3c857a48e7a177deec840cd9b176304343b2517 (origin/feature/fix-pyproject) Author: Yoav HaCohen [email protected] Date: Wed May 7 13:02:28 2025 +0300

Install: Fix install to include ltx_video directory only

commit 27de2ef3a2bcc9549624886454059cdf013d3899`

May 13 '25 09:05 StarsTesla

why don't you try 1216x704 81frames or 73 frames or 65 frames? or lower the resolution? like 1024 x 576 97 frames

May 14 '25 03:05 eisneim

cause,you just comment that you could run on 48GB 4090 under such settings.

May 14 '25 08:05 StarsTesla

Still OOM by setting 56 frames @eisneim (ltx-video) root@test-video-diffusion-singlecard-0:/mnt/csi-data-aly/user/xingchenzhou/code/git/LTX-Video# python inference.py \ --prompt "The video is captured from a camera mounted on a car. The camera is facing forward. \ The video is taken from the perspective of a vehicle's dashboard camera, showing a road passing through a construction zone. \ The road is partially lined with orange traffic cones and temporary barriers on both sides. \ The scene depicts dusk or early evening, with soft ambient light casting a glow on the surroundings. \ Ahead is a white car with illuminated tail lights driving on the same road. \ To the left are massive concrete pillars supporting an elevated highway or railway structure under construction. \ To the right is a temporary yellow/orange construction fence and a railway bridge visible in the distance. \ The urban landscape is visible in the background with buildings, trees, and utility poles. \ The overall scene conveys an urban development area with ongoing infrastructure construction work." \ --conditioning_media_paths /mnt/csi-data-aly/user/xingchenzhou/code/git/neuralaug/data/deeproute_exp/YR-C01-7_20240219_103637/images/0178_10.jpg \ --conditioning_start_frames 20 \ --pipeline_config configs/ltxv-13b-0.9.7-dev.yaml \ --num_frames 56 Running generation with arguments: Namespace(output_path=None, seed=171198, num_images_per_prompt=1, image_cond_noise_scale=0.15, height=704, width=1216, num_frames=56, frame_rate=30, device=None, pipeline_config='configs/ltxv-13b-0.9.7-dev.yaml', prompt="The video is captured from a camera mounted on a car. The camera is facing forward. The video is taken from the perspective of a vehicle's dashboard camera, showing a road passing through a construction zone. The road is partially lined with orange traffic cones and temporary barriers on both sides. The scene depicts dusk or early evening, with soft ambient light casting a glow on the surroundings. Ahead is a white car with illuminated tail lights driving on the same road. To the left are massive concrete pillars supporting an elevated highway or railway structure under construction. To the right is a temporary yellow/orange construction fence and a railway bridge visible in the distance. The urban landscape is visible in the background with buildings, trees, and utility poles. The overall scene conveys an urban development area with ongoing infrastructure construction work.", negative_prompt='worst quality, inconsistent motion, blurry, jittery, distorted', offload_to_cpu=False, input_media_path=None, strength=1.0, conditioning_media_paths=['/mnt/csi-data-aly/user/xingchenzhou/code/git/neuralaug/data/deeproute_exp/YR-C01-7_20240219_103637/images/0178_10.jpg'], conditioning_strengths=None, conditioning_start_frames=[20]) Padded dimensions: 704x1216x57 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 194.05it/s] Traceback (most recent call last): File "/mnt/csi-data-aly/user/xingchenzhou/code/git/LTX-Video/inference.py", line 776, in <module> main() File "/mnt/csi-data-aly/user/xingchenzhou/code/git/LTX-Video/inference.py", line 298, in main infer(**vars(args)) File "/mnt/csi-data-aly/user/xingchenzhou/code/git/LTX-Video/inference.py", line 535, in infer pipeline = create_ltx_video_pipeline( File "/mnt/csi-data-aly/user/xingchenzhou/code/git/LTX-Video/inference.py", line 343, in create_ltx_video_pipeline text_encoder = text_encoder.to(device) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3698, in to return super().to(*args, **kwargs) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1355, in to return self._apply(convert) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 942, in _apply param_applied = fn(param) File "/mnt/csi-data-aly/user/xingchenzhou/miniconda3/envs/ltx-video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1341, in convert return t.to( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 44.43 GiB of which 8.31 MiB is free. Process 3303141 has 44.41 GiB memory in use. Of the allocated memory 44.08 GiB is allocated by PyTorch, and 55.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

May 14 '25 08:05 StarsTesla

And setting 1024 x 576 is not helping, The all things down before diffusion process, maybe it's the problem of loading model.

May 14 '25 08:05 StarsTesla

I set the T5 to 4bit, this could helps the model run without OOM but still very close to OOM. Settings from :https://github.com/Lightricks/LTX-Video/issues/166

May 14 '25 08:05 StarsTesla

@StarsTesla it's becasue you did not turn off prompt enhancement !! you should try to write longer prompt or set prompt enhancement threshold to be 0

May 14 '25 14:05 eisneim

@eisneim No，I give a very long prompt like The video is captured from a camera mounted on a car. The camera is facing forward. The video is taken from the perspective of a vehicle's dashboard camera, showing dusk footage at a broad crosswalk in a Chinese urban environment. The scene depicts a wide multi-lane road with a clear view of an intersection with zebra crossing markings visible in the foreground. The digital traffic signal countdown displays decreasing numbers from 14 to 7 seconds as the recording progresses. High-rise residential apartment buildings with distinctive tan and gray facades dominate both sides of the street, creating an urban canyon effect. The twilight sky casts a cool blue-gray ambience over the scene as daylight fades. Traffic moves steadily, with white SUVs and sedans visible both ahead on the road and crossing perpendicular to the camera's position. A vehicle can be seen making a left turn across the intersection. Tree-lined medians and sidewalks frame the roadway, while street lamps with dual-headed fixtures are positioned at regular intervals. In the far distance, the road stretches toward what appears to be a mountain or hillside silhouette, creating depth to the urban landscape corridor. I am sure this will pass the threshold for not enhancement.

May 15 '25 10:05 StarsTesla

you should post how you run it, what file? what command? what error, and how big is the video, saying "failed" does not provide any context.

i can run 13B on RTX 4090 48GB to generate 1216x704 88frames without offloading to CPU

i met the oom error when using the command python inference.py --prompt "Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show." --height 768 --width 1024 --num_frames 144 --seed 0 --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml

then i got oom error. i find that some run the model successfully with a lower vram in other issue ,is there anything wrong that i neglect?

see this issue for deatil #201

Jun 04 '25 09:06 TediousBoredom

@TediousBoredom try something like clear cache, load transformer,vae of bf16, offload vae, textencoder and latent upsampler when unuse, 4bit text encoder; After done these, I can run 768x1024x88 like use 35-40GB video memory on 13b-0.9.7.

Jun 04 '25 11:06 StarsTesla