Docker swarm with TorchServe workflow
I want to scale the workflows through "Docker Swarm". (I hope it is possible, if not please tell me how one can achieve this? I know it is not supported yet through TorchServe directly, that is why I'm using docker to scale the workflow.) I have few questions related to using TorchServe as a docker service in swarm mode while I encountered few issues.
Problem Statement:
- We are using TorchServe workflow as we have multiple models required to complete the use case.
- To make sure that there isn't any difference I've set the number of workers to 2 on each node, so that memory consumption doesn't go above 16GB, and each node has same number of workers and memory.
- While creating a docker service, the manager node seems to work fine with the below TorchServe config and completes the task in desired time, but when the manager assigns the task to any of the worker node it takes ~3X more time.
- Problem we are facing is while a TorchServe worker is executing on the worker node, looks like it is executing with intervals. i.e., it doesn’t show continuous GPU utilization/processing and stops printing logs as well along with delay in response and meanwhile that if another request comes it will stop executing the current request and starts executing new one.
- I did see something in logs (unfortunately, I'm unable to provide the logs here) like, when node
m5is being executed and new request came then the current request directly stops (at least in the logs it looked like that, but no error was thrown) and new one starts. Correct me if I'm wrong but old request should be executing in the background, right? - Now, the question is, Does TorchServe support routing the request through docker swarm?
- If so, then what would be the correct configuration to achieve similar results on the all the nodes apart from manager in swarm?
My Docker Swarm Config:
- 3 nodes, 1 manager 2 workers
- Manager has 4 X v100 sxm-2, 32GB each, Worker has 4 X v100 sxm-2, 16GB each
My project config: (Please ignore the timeout, as I've put it this way because my inference request takes around 10 mins, as it takes over 100 images to process in a batch)
- There are 5 models
- model-config.yaml
maxBatchDelay: 10000000
responseTimeout: 10000000
- workflow.yaml
models:
min-workers: 1
max-workers: 2
max-batch-delay: 10000000
retry-attempts: 1
timeout-ms: 3000000
m1:
url: mode-1.mar
m2:
url: model-2.mar
m3:
url: model-3.mar
m4:
url: model-4.mar
m5:
url: model-5.mar
dag:
pre_processing: [m1]
m1: [m2]
m2: [m3]
m3: [m4]
m4: [m5]
m5: [post_processing]
- config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
# management
default_response_timeout=10000000
default_workers_per_model=2
load_models=
model_store=model_store
workflow_store=wf_store
enable_envvars_config=true
job_queue_size=3
Python Packages:
torch==1.13.1+cu117
torchvision==0.14.1+cu117
torchaudio==0.13.1+cu117
torchserve==0.10.0
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
nvgpu==0.10.0
captum==0.7.0
Hi @KD1994
This is not something that we have tried. We do have kubernetes and kserve support.
I would start with something simpler. Just a simple model served through docker swarm and see if you are not seeing these performance issues. Unfortunately, we haven't been actively developing workflow as we haven't come across specific asks recently. So, there might be perf issues with workflow on a single container deployment too. If this is something your organization is looking for, please send me a message and we can discuss.
Thanks, @agunapal for the quick response.
That is exactly my plan of action right now for testing it out even further for all the things possible. I just wanted to see if anyone had tried this and faced any issue with this. I'll let you know in case If I still see this issue.
Out of curiosity,
- Is there any plan to provide scaling functionality to workflows in near future?
- About Kubernetes, have you tried with multiple nodes in a cluster or with just one?
Yes, I have. if you are using aws, setup a cluster using https://github.com/aws-samples/aws-do-eks and then use this to launch torchserve with a bert model, https://github.com/aws-solutions-library-samples/guidance-for-machine-learning-inference-on-aws/pull/15
Ok, thanks for the info. I will look into this.
@agunapal thanks for your time.
I was able to get this done. So, I'll be closing this. It turned out the issue is with NFS share configuration, neither TorchServe nor Docker swarm.
That's awesome. Great to hear.