"Already running a prediction" When hitting multiple requests
This is not exactly issue, So the situation is,
I am running COG container locally and I want to process multiple requests at once, however when i hit 100 requests at once it gave me output for 20 and for rest it gave me "Already running a prediction". however my system utilisation was very low How can i do this in parallel.
I am using Image similarity model with VIT and it uses GPU.
Ditto, trying to figure this out as well, using the latest beta version. I've tried by setting threads:
docker run -d -p 5000:5000 <container> python -m cog.server.http --threads=8
But still keep hitting the "Already running a prediction" error.
I thought it was because of GPU only that we can make one prediction at a time but it's same for CPUs as well
There's a new version https://github.com/replicate/cog/releases/tag/v0.9.0-beta9 that has support for async predictor functions. That might help?
cc @technillogue
We hope to roll out concurrent predictions in the next months, but the 0.9.0b9 only allows async def predict, not concurrent predictions.
The threads argument controls how many HTTP requests can be served concurrently, but right now unless I'm mistaken Predictor.predict can still only run one prediction at a time.
Even if that wasn't the case, it's very hard to use torch and get true GPU concurrency without ultimately implementing something like batching or microbatching. For right now, if you can implement batching yourself, that's best.
Okay, thanks for the update. In my case what I've done is set up ~5 docker containers on separate ports, and then used nginx to load balance between them. This allows me to have up to 5 ongoing predictions at any given time.
@technillogue +1 to concurrent predictions
+1 to concurrent predictions!