TensorRT flux-demo failure of TensorRT 10.5 when running a single L40 GPU, how to implement 2-GPUs with L40

Currently, when I run Flux on a device with a single L40 GPU, I encounter an OutOfMemory error. I found another device L40 with two GPUs. How can I implement multi-GPU usage to run flux?

Oct 17 '24 02:10 algorithmconquer

You can split your model.

Oct 18 '24 14:10 lix19937

also cc: @asfiyab-nvidia

Oct 18 '24 22:10 yuanyao-nv

@lix19937 How to split the model for this issue?Could you provide relevant codes and resources?

Oct 21 '24 02:10 algorithmconquer

Like follow：

assume model = cnn_backbone + cnn_neck + transformer_with_cnn_head    
then you can export `cnn_backbone + cnn_neck`   as onnx_a,  
                    `transformer_with_cnn_head` as onnx_b, 
then use trtexec make `onnx_a -> plan_a`  
                      `onnx_b -> plan_b`   

plan_a run at decive 0, plan_b run at device 1.

More deatiled:
Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

Oct 22 '24 09:10 lix19937