WhisperLive icon indicating copy to clipboard operation
WhisperLive copied to clipboard

TensorRT has no output

Open hexdrx opened this issue 10 months ago • 1 comments

I build TensorRT variant of base model, but after starting server it has no output

Here is my terminal:

root@e8c247ca52a0:/app# bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples base 
Requirement already satisfied: tensorrt_llm==0.15.0.dev2024111200 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 1)) (0.15.0.dev2024111200)
Requirement already satisfied: tiktoken in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 2)) (0.9.0)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 3)) (3.5.1)
Requirement already satisfied: kaldialign in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 4)) (0.9.1)
Requirement already satisfied: openai-whisper in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 5)) (20240930)
Requirement already satisfied: librosa in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 6)) (0.11.0)
Requirement already satisfied: soundfile in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 7)) (0.13.1)
Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 8)) (0.5.3)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 9)) (4.45.1)
Requirement already satisfied: janus in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 10)) (2.0.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Downloading base...
--2025-05-03 10:29:35--  https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt
Resolving openaipublic.azureedge.net (openaipublic.azureedge.net)... 13.107.246.53, 2620:1ec:bdf::53
Connecting to openaipublic.azureedge.net (openaipublic.azureedge.net)|13.107.246.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145262807 (139M) [application/octet-stream]
Saving to: 'assets/base.pt'

base.pt                          100%[==========================================================>] 138.53M  8.79MB/s    in 15s     

2025-05-03 10:29:50 (9.55 MB/s) - 'assets/base.pt' saved [145262807/145262807]

Download completed: base.pt
whisper_base_float16
Converting model weights for base...
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024111200
0.15.0.dev2024111200
/app/TensorRT-LLM-examples/whisper/convert_checkpoint.py:394: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(model_path, map_location='cpu')
Loaded model from assets/base.pt
Converting encoder checkpoints...
Converting decoder checkpoints...
Total time of converting checkpoints: 00:00:00
Building encoder for base...
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024111200
[05/03/2025-10:30:02] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set gemm_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set nccl_plugin to auto.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set lora_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set moe_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set context_fmha to True.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set remove_input_padding to True.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set reduce_fusion to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set enable_xqa to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set tokens_per_block to 64.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set multiple_profiles to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set paged_state to True.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set streamingllm to False.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set use_fused_mlp to True.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[05/03/2025-10:30:02] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[05/03/2025-10:30:02] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80
[05/03/2025-10:30:02] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99
[05/03/2025-10:30:02] [TRT-LLM] [I] Compute capability: (8, 6)
[05/03/2025-10:30:02] [TRT-LLM] [I] SM count: 82
[05/03/2025-10:30:02] [TRT-LLM] [I] SM clock: 2100 MHz
[05/03/2025-10:30:02] [TRT-LLM] [I] int4 TFLOPS: 705
[05/03/2025-10:30:02] [TRT-LLM] [I] int8 TFLOPS: 352
[05/03/2025-10:30:02] [TRT-LLM] [I] fp8 TFLOPS: 0
[05/03/2025-10:30:02] [TRT-LLM] [I] float16 TFLOPS: 176
[05/03/2025-10:30:02] [TRT-LLM] [I] bfloat16 TFLOPS: 176
[05/03/2025-10:30:02] [TRT-LLM] [I] float32 TFLOPS: 88
[05/03/2025-10:30:02] [TRT-LLM] [I] Total Memory: 24 GiB
[05/03/2025-10:30:02] [TRT-LLM] [I] Memory clock: 9751 MHz
[05/03/2025-10:30:02] [TRT-LLM] [I] Memory bus width: 384
[05/03/2025-10:30:02] [TRT-LLM] [I] Memory bandwidth: 936 GB/s
[05/03/2025-10:30:02] [TRT-LLM] [I] NVLink is active: False
[05/03/2025-10:30:02] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[05/03/2025-10:30:02] [TRT-LLM] [I] PCIe link width: 16
[05/03/2025-10:30:02] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[05/03/2025-10:30:02] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[05/03/2025-10:30:02] [TRT-LLM] [I] Set dtype to float16.
[05/03/2025-10:30:02] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/03/2025-10:30:02] [TRT-LLM] [W] Overriding paged_state to False
[05/03/2025-10:30:02] [TRT-LLM] [I] Set paged_state to False.
[05/03/2025-10:30:02] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[05/03/2025-10:30:02] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[05/03/2025-10:30:02] [TRT-LLM] [W] max_num_tokens (3000) shouldn't be greater than max_seq_len * max_batch_size (3000), specifying to max_seq_len * max_batch_size (3000).
[05/03/2025-10:30:02] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[05/03/2025-10:30:02] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 162, GPU 508 (MiB)
[05/03/2025-10:30:05] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2166, GPU +406, now: CPU 2484, GPU 914 (MiB)
[05/03/2025-10:30:05] [TRT-LLM] [I] Set nccl_plugin to None.
[05/03/2025-10:30:05] [TRT-LLM] [I] Total time of constructing network from module object 3.392195463180542 seconds
[05/03/2025-10:30:05] [TRT-LLM] [I] Total optimization profiles added: 1
[05/03/2025-10:30:05] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[05/03/2025-10:30:05] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[05/03/2025-10:30:05] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/03/2025-10:30:06] [TRT] [I] Compiler backend is used during engine build.
[05/03/2025-10:30:28] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[05/03/2025-10:30:29] [TRT] [I] Total Host Persistent Memory: 13792 bytes
[05/03/2025-10:30:29] [TRT] [I] Total Device Persistent Memory: 0 bytes
[05/03/2025-10:30:29] [TRT] [I] Max Scratch Memory: 33554688 bytes
[05/03/2025-10:30:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 40 steps to complete.
[05/03/2025-10:30:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.976605ms to assign 8 blocks to 40 nodes requiring 44308992 bytes.
[05/03/2025-10:30:29] [TRT] [I] Total Activation Memory: 44307456 bytes
[05/03/2025-10:30:29] [TRT] [I] Total Weights Memory: 42742404 bytes
[05/03/2025-10:30:29] [TRT] [I] Compiler backend is used during engine execution.
[05/03/2025-10:30:29] [TRT] [I] Engine generation completed in 24.1297 seconds.
[05/03/2025-10:30:29] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 41 MiB
[05/03/2025-10:30:29] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:24
[05/03/2025-10:30:29] [TRT] [I] Serialized 1208 bytes of code generator cache.
[05/03/2025-10:30:29] [TRT] [I] Serialized 1737792 bytes of compilation cache.
[05/03/2025-10:30:29] [TRT] [I] Serialized 124 timing cache entries
[05/03/2025-10:30:29] [TRT-LLM] [I] Timing cache serialized to model.cache
[05/03/2025-10:30:29] [TRT-LLM] [I] Build phase peak memory: 5991.59 MB, children: 25.12 MB
[05/03/2025-10:30:29] [TRT-LLM] [I] Serializing engine to whisper_base_float16/encoder/rank0.engine...
[05/03/2025-10:30:29] [TRT-LLM] [I] Engine serialized. Total time: 00:00:00
[05/03/2025-10:30:29] [TRT-LLM] [I] Total time of building all engines: 00:00:27
Building decoder for base...
[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024111200
[05/03/2025-10:30:36] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set gemm_plugin to float16.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set nccl_plugin to auto.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set lora_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set moe_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set context_fmha to True.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set remove_input_padding to True.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set reduce_fusion to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set enable_xqa to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set tokens_per_block to 64.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set multiple_profiles to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set paged_state to True.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set streamingllm to False.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set use_fused_mlp to True.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_prompt_tuning = False
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = False
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ffn_hidden_size = 2048
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = whisper
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rescale_before_lm_head = False
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_hidden_size = 512
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_num_heads = 8
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_head_size = None
[05/03/2025-10:30:36] [TRT-LLM] [W] Implicitly setting PretrainedConfig.skip_cross_kv = False
[05/03/2025-10:30:36] [TRT-LLM] [I] Compute capability: (8, 6)
[05/03/2025-10:30:36] [TRT-LLM] [I] SM count: 82
[05/03/2025-10:30:36] [TRT-LLM] [I] SM clock: 2100 MHz
[05/03/2025-10:30:36] [TRT-LLM] [I] int4 TFLOPS: 705
[05/03/2025-10:30:36] [TRT-LLM] [I] int8 TFLOPS: 352
[05/03/2025-10:30:36] [TRT-LLM] [I] fp8 TFLOPS: 0
[05/03/2025-10:30:36] [TRT-LLM] [I] float16 TFLOPS: 176
[05/03/2025-10:30:36] [TRT-LLM] [I] bfloat16 TFLOPS: 176
[05/03/2025-10:30:36] [TRT-LLM] [I] float32 TFLOPS: 88
[05/03/2025-10:30:36] [TRT-LLM] [I] Total Memory: 24 GiB
[05/03/2025-10:30:36] [TRT-LLM] [I] Memory clock: 9751 MHz
[05/03/2025-10:30:36] [TRT-LLM] [I] Memory bus width: 384
[05/03/2025-10:30:36] [TRT-LLM] [I] Memory bandwidth: 936 GB/s
[05/03/2025-10:30:36] [TRT-LLM] [I] NVLink is active: False
[05/03/2025-10:30:36] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[05/03/2025-10:30:36] [TRT-LLM] [I] PCIe link width: 16
[05/03/2025-10:30:36] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[05/03/2025-10:30:36] [TRT-LLM] [I] Set dtype to float16.
[05/03/2025-10:30:36] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/03/2025-10:30:36] [TRT-LLM] [W] Overriding paged_state to False
[05/03/2025-10:30:36] [TRT-LLM] [I] Set paged_state to False.
[05/03/2025-10:30:36] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[05/03/2025-10:30:36] [TRT-LLM] [W] max_num_tokens (200) shouldn't be greater than max_seq_len * max_batch_size (200), specifying to max_seq_len * max_batch_size (200).
[05/03/2025-10:30:36] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[05/03/2025-10:30:36] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 162, GPU 508 (MiB)
[05/03/2025-10:30:39] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2166, GPU +406, now: CPU 2484, GPU 914 (MiB)
[05/03/2025-10:30:39] [TRT-LLM] [I] Set nccl_plugin to None.
[05/03/2025-10:30:40] [TRT-LLM] [I] Total time of constructing network from module object 3.415430784225464 seconds
[05/03/2025-10:30:40] [TRT-LLM] [I] Total optimization profiles added: 1
[05/03/2025-10:30:40] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[05/03/2025-10:30:40] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[05/03/2025-10:30:40] [TRT] [W] Unused Input: host_kv_cache_block_offsets
[05/03/2025-10:30:40] [TRT] [W] Unused Input: cross_kv_cache_gen
[05/03/2025-10:30:40] [TRT] [W] [RemoveDeadLayers] Input Tensor host_kv_cache_block_offsets is unused or used only at compile-time, but is not being removed.
[05/03/2025-10:30:40] [TRT] [W] [RemoveDeadLayers] Input Tensor cross_kv_cache_gen is unused or used only at compile-time, but is not being removed.
[05/03/2025-10:30:40] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/03/2025-10:30:40] [TRT] [I] Compiler backend is used during engine build.
[05/03/2025-10:30:42] [TRT] [E] Error Code: 9: Skipping tactic 0x00000000000003ea due to exception Unsupported data type Bool.
[05/03/2025-10:30:42] [TRT] [I] Detected 27 inputs and 1 output network tensors.
[05/03/2025-10:30:43] [TRT] [I] Total Host Persistent Memory: 37136 bytes
[05/03/2025-10:30:43] [TRT] [I] Total Device Persistent Memory: 0 bytes
[05/03/2025-10:30:43] [TRT] [I] Max Scratch Memory: 33569280 bytes
[05/03/2025-10:30:43] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 211 steps to complete.
[05/03/2025-10:30:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 26.0572ms to assign 28 blocks to 211 nodes requiring 42925568 bytes.
[05/03/2025-10:30:43] [TRT] [I] Total Activation Memory: 42924544 bytes
[05/03/2025-10:30:43] [TRT] [I] Total Weights Memory: 157136256 bytes
[05/03/2025-10:30:43] [TRT] [I] Compiler backend is used during engine execution.
[05/03/2025-10:30:43] [TRT] [I] Engine generation completed in 3.00621 seconds.
[05/03/2025-10:30:43] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 150 MiB
[05/03/2025-10:30:43] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:03
[05/03/2025-10:30:43] [TRT] [I] Serialized 27 bytes of code generator cache.
[05/03/2025-10:30:43] [TRT] [I] Serialized 122763 bytes of compilation cache.
[05/03/2025-10:30:43] [TRT] [I] Serialized 37 timing cache entries
[05/03/2025-10:30:43] [TRT-LLM] [I] Timing cache serialized to model.cache
[05/03/2025-10:30:43] [TRT-LLM] [I] Build phase peak memory: 4851.33 MB, children: 24.88 MB
[05/03/2025-10:30:43] [TRT-LLM] [I] Serializing engine to whisper_base_float16/decoder/rank0.engine...
[05/03/2025-10:30:43] [TRT-LLM] [I] Engine serialized. Total time: 00:00:00
[05/03/2025-10:30:43] [TRT-LLM] [I] Total time of building all engines: 00:00:06
TensorRT LLM engine built for base.
=========================================
Model is located at: /app/TensorRT-LLM-examples/whisper/whisper_base_float16
root@e8c247ca52a0:/app# python3 run_server.py --port 9090 \
                      --backend tensorrt \
                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_base_float16" \
                      --trt_multilingual
INFO:root:Custom model option was provided. Switching to single model mode.

In top this process goes to sleep

hexdrx avatar May 03 '25 10:05 hexdrx

@hexdrx you need to run the client and connect to the server to instantiate a trt model instance on the server.

makaveli10 avatar May 04 '25 14:05 makaveli10