Automatic loading and unloading of model.
🚀 The feature
Torchserve automatically loads and unloads the model on the basis of the request. If I have registered 3 models in torchserve. If one of the models does not get any hit in like 1 day, it will automatically unload the model from memory. Once I got the hit for that model, it will be loaded back to memory. (like the one provided by AWS Sagemaker multi-model Endpoint)
Motivation, pitch
Currently, we have to use management API to set no of workers to make inferences on that model. If my model is not going to be used for some time, I have to manually set no of workers to 0, if not, then it's continuously consuming resources, even if it's not in use. I would like to set my all models to 0 initial workers, and whenever I inference on one, it will be loaded with 1 worker.
Alternatives
No response
Additional context
No response
@msaroufim
@amit-cashify @abhinav-cashify AWS Sagemaker multi-model Endpoint makes call to TorchServe to unload model based on the memory usage. This elastic loading/unloading feature is provided by Sagemaker hosting service. Customers has to pay for the inference latency spike due to model reloading cost.
On TorchServe roadmap, we are going to address memory usage and elastic parallel processing issues by providing the following features:
- model sharing (ie. one model copy in memory can be shared by multiple workers)
- #model workers is elastic according to inference traffic volume.
Note: here,
- #modell workers = 0 does not mean #model copy = 0.
- #model copy = 0 only if unload model request is received.
Please let us know if you have any questions.
Any update on this?
Any update? I have quite same problem with this