GuanLuo comments

Results 76 comments of


                                            GuanLuo

Soften grpcio requirement in python library

I think this change is fine, @Tabrizian we probably need to update the CI dockerfile to install 1.41 explicitly?

Shared weights whenever multiple instances

CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?

Shared weights whenever multiple instances

> Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem. Yes, this is what I was...

ORT_DISABLE_ALL optimization level

I agree that the current implementation doesn't cover all optimization level exposed by ORT, @pranavsharma @askhade should ORT backend changes the interpretation of [`level` field](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L697-L713) in model config so that...

Use special ORT branch 'tensorrt-8.5ea' brought in by the ORT 1.12.1 release to make use of the built-in tensorrt parser.

@nv-kmcgill53 ^^^

Incomprehensible overhead in Tritonserver inference

Creating 2 InferHandler is just an empirical decision base on experiments and [GRPC performance guide](https://grpc.io/docs/guides/performance/#c). Adding a point on Q1, as Iman mentioned, you should experiment with more model instances...

Allow loading/unloading of specific version for a given model

Now that in model load API, you can [specify a string representation of the config file](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md#load) as part of the request, which may address your use case, in the override...

Allow loading/unloading of specific version for a given model

I believe you can send the request like the following to load the specified version (take "3" as example): ``` POST /v2/repository/models/mymodel/load HTTP/1.1 Host: localhost:8000 { "parameters": { "config": "{...

Tritonbackend multiple instances of the same model run sequentially instead of in parallel on the same device with asyncrhonous requests.

The GPU utilization refers to whether the GPU is computing at full capacity, if you use `nvidia-smi` to monitor GPU status, you will notice that GPU utilization is a different...

Python Backend to support GPU instance

@Tabrizian can provide more detail, AFAIK, we built Python backend for Jetson with `TRITON_ENABLE_GPU=OFF` because otherwise it uses CUDA IPC feature which is not supported in Jetson. I think the...