GuanLuo

Results 76 comments of GuanLuo

I think this change is fine, @Tabrizian we probably need to update the CI dockerfile to install 1.41 explicitly?

CC @pranavsharma does ORT provides API for doing so? Or can a ORT session be run for different inferences in parallel?

> Someone has submitted code changes to share a session between different instances. We're reviewing the changes. This should fix the memory consumption problem. Yes, this is what I was...

I agree that the current implementation doesn't cover all optimization level exposed by ORT, @pranavsharma @askhade should ORT backend changes the interpretation of [`level` field](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L697-L713) in model config so that...

Creating 2 InferHandler is just an empirical decision base on experiments and [GRPC performance guide](https://grpc.io/docs/guides/performance/#c). Adding a point on Q1, as Iman mentioned, you should experiment with more model instances...

Now that in model load API, you can [specify a string representation of the config file](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md#load) as part of the request, which may address your use case, in the override...

I believe you can send the request like the following to load the specified version (take "3" as example): ``` POST /v2/repository/models/mymodel/load HTTP/1.1 Host: localhost:8000 { "parameters": { "config": "{...

The GPU utilization refers to whether the GPU is computing at full capacity, if you use `nvidia-smi` to monitor GPU status, you will notice that GPU utilization is a different...

@Tabrizian can provide more detail, AFAIK, we built Python backend for Jetson with `TRITON_ENABLE_GPU=OFF` because otherwise it uses CUDA IPC feature which is not supported in Jetson. I think the...