serve Docker swarm with TorchServe workflow

I want to scale the workflows through "Docker Swarm". (I hope it is possible, if not please tell me how one can achieve this? I know it is not supported yet through TorchServe directly, that is why I'm using docker to scale the workflow.) I have few questions related to using TorchServe as a docker service in swarm mode while I encountered few issues.

Problem Statement:

We are using TorchServe workflow as we have multiple models required to complete the use case.
To make sure that there isn't any difference I've set the number of workers to 2 on each node, so that memory consumption doesn't go above 16GB, and each node has same number of workers and memory.
While creating a docker service, the manager node seems to work fine with the below TorchServe config and completes the task in desired time, but when the manager assigns the task to any of the worker node it takes ~3X more time.
Problem we are facing is while a TorchServe worker is executing on the worker node, looks like it is executing with intervals. i.e., it doesn’t show continuous GPU utilization/processing and stops printing logs as well along with delay in response and meanwhile that if another request comes it will stop executing the current request and starts executing new one.
I did see something in logs (unfortunately, I'm unable to provide the logs here) like, when node m5 is being executed and new request came then the current request directly stops (at least in the logs it looked like that, but no error was thrown) and new one starts. Correct me if I'm wrong but old request should be executing in the background, right?
Now, the question is, Does TorchServe support routing the request through docker swarm?
If so, then what would be the correct configuration to achieve similar results on the all the nodes apart from manager in swarm?

My Docker Swarm Config:

3 nodes, 1 manager 2 workers
Manager has 4 X v100 sxm-2, 32GB each, Worker has 4 X v100 sxm-2, 16GB each

My project config: (Please ignore the timeout, as I've put it this way because my inference request takes around 10 mins, as it takes over 100 images to process in a batch)

There are 5 models
model-config.yaml

maxBatchDelay: 10000000
responseTimeout: 10000000

workflow.yaml

models:
    min-workers: 1
    max-workers: 2
    max-batch-delay: 10000000
    retry-attempts: 1
    timeout-ms: 3000000

    m1:
      url: mode-1.mar

    m2:
      url: model-2.mar

    m3:
      url: model-3.mar

    m4:
      url: model-4.mar

    m5:
      url: model-5.mar
  
dag:
  pre_processing: [m1]
  m1: [m2]
  m2: [m3]
  m3: [m4]
  m4: [m5]
  m5: [post_processing]

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# management
default_response_timeout=10000000
default_workers_per_model=2

load_models=
model_store=model_store
workflow_store=wf_store

enable_envvars_config=true
job_queue_size=3

Python Packages:

torch==1.13.1+cu117
torchvision==0.14.1+cu117
torchaudio==0.13.1+cu117
torchserve==0.10.0
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
nvgpu==0.10.0
captum==0.7.0

Jun 26 '24 16:06 KD1994

Hi @KD1994

This is not something that we have tried. We do have kubernetes and kserve support.

I would start with something simpler. Just a simple model served through docker swarm and see if you are not seeing these performance issues. Unfortunately, we haven't been actively developing workflow as we haven't come across specific asks recently. So, there might be perf issues with workflow on a single container deployment too. If this is something your organization is looking for, please send me a message and we can discuss.

Jun 26 '24 21:06 agunapal

Thanks, @agunapal for the quick response.

That is exactly my plan of action right now for testing it out even further for all the things possible. I just wanted to see if anyone had tried this and faced any issue with this. I'll let you know in case If I still see this issue.

Out of curiosity,

Is there any plan to provide scaling functionality to workflows in near future?
About Kubernetes, have you tried with multiple nodes in a cluster or with just one?

Jun 27 '24 05:06 KD1994

Yes, I have. if you are using aws, setup a cluster using https://github.com/aws-samples/aws-do-eks and then use this to launch torchserve with a bert model, https://github.com/aws-solutions-library-samples/guidance-for-machine-learning-inference-on-aws/pull/15

Jun 27 '24 18:06 agunapal

Ok, thanks for the info. I will look into this.

Jun 28 '24 05:06 KD1994

@agunapal thanks for your time.

I was able to get this done. So, I'll be closing this. It turned out the issue is with NFS share configuration, neither TorchServe nor Docker swarm.

Jul 25 '24 13:07 KD1994

That's awesome. Great to hear.

Jul 25 '24 14:07 agunapal