MLServer custom runtime is slower than Python Wrapper in Seldon-core
Hi,
I have been doing some benchmarking work on MLServer custom runtimes vs Python wrapper APIs in seldon-core, for the same model, and same resources, and found that a seldon-core Python wrapper has higher throughput and lower latencies. I wasn't using dynamic batching for MLserver, but I found its performance to be much worse, and it also has a tendency to hang up when the number of concurrent users is high, whereas seldon-core shrugged it off like it was nothing.
This sounds a bit counterintuitive, since MLServer is powered by FastAPI whereas seldon-core is powered by Flask. This got me thinking that the issue might be the use of uvicorn as a process manager, instead of using gunicorn, as is recommended in FastAPI's documentation here. Not sure what process manager is being used in either case, but just thought to reach out to understand the difference in performance between the two serving methods.
Hey @hseelawi ,
Thanks a lot for your report.
I think you raise a good point on gunicorn vs uvicorn. Although the results you get seem to differ from our internal tests.
Do you have a benchmark that we can have a look at?
Hi @adriangonz,
Thanks for the kind reply. I am attaching two reports (python-wrapper, and mlserver) generated using locust. They are attached as csv files but that is mainly because github won't let you upload html.
The results above were obtained after pumping up the resources requested (and limited) to 4 CPU and 4 Gi memory, replicated twice, because I figured out the hanging up of Mlserver could be due to resource issues (and it was indeed). However, it is still performing slightly worse than a python wrapper. Adaptive batching yielded wonderful results (file), but I was wondering if using a gunicorn server with uvicorn compatible workers might make Mlserver better even without adaptive batching.
The stress testing scenario used was as follows: a total time of 2 hours, starting with a single user, and then incrementing by one user each minute. All the concurrent users will immediately fire back at the api once they get a response back from it.
One thing I forgot to mention was, I used two workers for each replica for both, Mlserver and seldon-core.