Amit Timalsina

Results 7 comments of Amit Timalsina

> hey @amit-timalsina, the picture above describes the serving case for a two-tower retrieval model. How, can you please elaborate?

You have to use openai compatible triton front-end. But note that it is in beta. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_guide/openai_readme.html I have already done this so let me know if you need further help.

This should help. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/vLLM/README.html Triton is a wrapper of multiple inference engines like vllm, tensortllm, python, etc. You have to setup model repository provide it to triton and you can...

Triton server is running, right? Then it should work.

@dinhanhx Did you find a fix? I need to use vllm > 0.10.0 because i need to use transformers>=4.54.0 as it has support for gpt oss and doesn't have the...

``` class TritonPythonModel: def initialize(self, args): self.args = args self.logger = pb_utils.Logger self.model_config = json.loads(args["model_config"]) output_config = pb_utils.get_output_config_by_name( self.model_config, "text_output" ) self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) # Setup vLLM engine health check...