How to trigger model unloading so that triton won't OOM?

Open seankst opened this issue 2 years ago • 1 comments

Hi, I am wondering how to avoid Triton OOM after loading several models? Basically, I load several model replicas, with the memory limits (e.g., 4Gi) defined in triton-2.x.yaml. When I start to create isvc one by one, I notice that the memory (both GPU and CPU) keeps increasing.

What I expect to happen: If the memory limits setting is 4Gi, and each model is 500M, when I create a new isvc and if the memory consumption is below 3.5Gi, I am able to create new isvc. When I create a new isvc and if the memory consumption is above 3.5Gi, it will trigger the unloading of an existing model, to make room for the new isvc.

What I actually see: I can continue to load models until the triton log shows OOM. Below is the log: 2023-06-10T09:49:10.430806856-04:00 {"instant":{"epochSecond":1686404950,"nanoOfSecond":430755558},"thread":"model-load-model-xxx","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Starting load for model model-xxx type=mt:onnx","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":66,"threadPriority":5} 2023-06-10T09:49:10.431219969-04:00 {"instant":{"epochSecond":1686404950,"nanoOfSecond":431176938},"thread":"mm-task-thread-3","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1686404821197 (2 minutes ago), count=5, capacity=603325, used=488952 (81%), loc=xxx, zone=<none>, labels=[mt:keras, mt:keras:2, mt:onnx, mt:onnx:1, mt:pytorch, mt:pytorch:1, mt:tensorflow, mt:tensorflow:1, mt:tensorflow:2, mt:tensorrt, mt:tensorrt:7, pv:grpc-v2, pv:v2, rt:triton-2.x], startTime=1686367870626 (10 hours ago), vers=0, loadThreads=2, loadInProg=1, reqsPerMin=1], UBW=61036, TUW=-61036, TCO=488952","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":40,"threadPriority":5} 2023-06-10T09:49:11.261471567-04:00 {"instant":{"epochSecond":1686404951,"nanoOfSecond":261330791},"thread":"model-load-model-xxx","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Load failed for model model-xxx type=mt:onnx after 830ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":66,"threadPriority":5}

What I've tried: I am trying to change the setting related to memory management, to hope that the system gives the correct behavior. I set below in triton-2.x.yaml: env: - name: MODELSIZE_MULTIPLIER value: "2" - name: DEFAULT_MODELSIZE value: "500000000" But still triton reports OOM when loading new isvc.

Jun 10 '23 15:06 seankst

@seankst see if that helps https://github.com/kserve/modelmesh/issues/82#issuecomment-1582028690

Jun 19 '23 12:06 GolanLevy