Some collections stop responding while the server seems ok
Current Behavior
version: 1.9.1
We facing some weird not response issue yesterday, some of the collections stop working on any request including /search /search/batch /scroll, even /collections/collection_name, i try to get collection status to give me some insight, but it hangs, i also try using curl on server itself so there may not any kind of network issues, and there are no any timeout logs, we also check the cpu, mem, disk io, but everything looks quite well, for make our app back in life quickly, we choose to reboot qdrant then.
Some related machine info shows below, the peek in the middle is cause by a short query benchmark to make sure it works fine after we reboot:
Steps to Reproduce
Every thing looks good since we restarted it yesterday, don't know how to reproduce. Through digging around logs i found one thing that i don't understand but might lead to some issues:
2024-05-07T04:23:15.180072Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:23:28.601938Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:23:55.445584Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:24:49.132774Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:26:36.507051Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:26:57.573629Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1960 "-" "python-httpx/0.25.2" 0.003233
2024-05-07T04:26:57.911320Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1773 "-" "python-httpx/0.25.2" 0.002333
2024-05-07T04:27:01.561033Z INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000456
2024-05-07T04:27:04.309453Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 16240 "-" "python-httpx/0.25.2" 0.013256
2024-05-07T04:30:11.255518Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:30:35.034610Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:30:48.456474Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:31:15.300084Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:32:01.525581Z INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000441
2024-05-07T04:32:08.987269Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:33:56.361553Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:37:01.584066Z INFO actix_web::middleware::logger: 10.0.10.198 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000424
2024-05-07T04:37:14.223721Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1722 "-" "python-httpx/0.25.2" 0.002675
2024-05-07T04:37:31.110006Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:38:01.258506Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:38:14.680405Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:38:41.524051Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:39:35.211267Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:41:22.585551Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:42:01.696012Z INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000374
2024-05-07T04:44:23.131041Z INFO actix_web::middleware::logger: 3.113.30.49 "GET /dashboard/logo192.png HTTP/1.1" 200 8870 "-" "Slack-ImgProxy (+https://api.slack.com/robots)" 0.008243
2024-05-07T04:44:57.334028Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:45:44.079306Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:45:57.501191Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:46:24.344836Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:47:01.592443Z INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000460
2024-05-07T04:47:18.032024Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
2024-05-07T04:47:58.042329Z INFO actix_web::middleware::logger: 10.0.10.99 "POST /collections/entities/points/search/batch HTTP/1.1" 200 3321 "-" "python-httpx/0.25.2" 0.003748
2024-05-07T04:47:58.353717Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1184 "-" "python-httpx/0.25.2" 0.003491
2024-05-07T04:47:58.354665Z INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1620 "-" "python-httpx/0.25.2" 0.004120
And Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates also appears in today's logs, i'm not sure what makes these logs keeps showing and what possible issues they may lead to.
Hope you guys can provide some solution or insight to figure it out, thanks!
hey @hugh2slowmo, thanks for reporting! Will try to reproduce it on our side
Question: what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?
Also, are you making snapshots by any chance?
hi @timvisee
what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?
The batch searches shown in logs are request to 'entities' collection, it performs normal in that time. We noticed this issue around 2024-05-07 02:30 UTC, map to image which using CST needs to plus 8, around 10:30, sorry about that confusing timeline. And due to deployment and some reason, we lost some logs, seems i'm providing not related logs to you guys, my bad. And we're getting not response issue on a collection call 'texts', in our app will send points/search and points/search/batch request to it most the time, and we try to send /collections/texts and points/scroll via dashboard's console and curl from remote and server localhost. The most weird thing is that there's no any logs showing, i doubt there's network issues at first, then i try curl from server, which is curl http://localhost:6333/collections/texts, and it also not responding, and some other collections are serving normally from inside and outside traffic, so i have no idea what's happening out there. By that reason, i cannot provide collection status and info at that time.
Also, are you making snapshots by any chance?
We normally take snapshots on midnight trigger by ops scripts, and i didn't see any snapshots request logs at least.
Maybe it's not a easy thing to reproduce by the information i provided. Or should we use TRACE log level in production? I guess it might be helpful when something bizarre happens.
Any performance impact by setting to that?
Thank you for elaborating.
Yeah this is definitely hitting a deadlock. It means things get stuck and access to a specific collection is blocked. That's why you're seeing the above mentioned behavior. Of course, this should never happen no matter what requests you send to it.
It looks like I've been able to reproduce it somewhat reliably locally, which makes it easier to debug this.
You can enable trace logging, but I'm not sure how valuable it might be. It should not have a significant impact on performance, but there will be a lot more logs to store. Maybe we can give this a try later if I cannot resolve this locally.
I've located and fixed a deadlock issue in https://github.com/qdrant/qdrant/pull/4206, and am quite confident it'll fix the issue you're seeing as well.
Once merged our plan is to release a patch release shortly after. Once that's done, maybe you can try to upgrade to see whether you see the same issue again. :crossed_fingers:
Well done Tim, we'll give it a try when the next patch has been released.
I've just published the release: https://github.com/qdrant/qdrant/releases/tag/v1.9.2
Please let us know if it resolves the problems you're experiencing.
@timvisee Hey Tim, seems we met the issue again, we're now using v1.9.2
curl localhost:6333
{"title":"qdrant - vector search engine","version":"1.9.2","commit":"34f7f8ec2d8455c0b3fe0f1fb134907ba7816a08"}
- and i try to see if there some slow query which response time greater than 1 second, but only a few through multiple days, logs time using default UTC+0
"2024-05-22T07:23:19.627902Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 313 \"-\" \"python-httpx/0.25.2\" 1.988359 \n"
"2024-05-22T07:23:19.750302Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 767 \"-\" \"python-httpx/0.25.2\" 2.453608 \n"
"2024-05-22T07:23:20.177431Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 908 \"-\" \"python-httpx/0.25.2\" 2.654047 \n"
"2024-05-24T02:52:01.997137Z INFO actix_web::middleware::logger: 10.0.10.28 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 11457 \"-\" \"python-httpx/0.25.2\" 2.598561 \n"
"2024-05-25T00:39:53.713344Z INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 12346 \"-\" \"python-httpx/0.25.2\" 2.015387 \n"
"2024-05-26T03:07:59.902028Z INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 10301 \"-\" \"python-httpx/0.25.2\" 1.368371 \n"
"2024-05-27T00:43:35.108451Z INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 13381 \"-\" \"python-httpx/0.25.2\" 1.037464 \n"
"2024-05-27T04:12:22.801538Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 121083 \"-\" \"python-httpx/0.25.2\" 2.275867 \n"
"2024-05-27T04:12:45.348556Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 104146 \"-\" \"python-httpx/0.25.2\" 1.028009 \n"
"2024-05-27T04:13:22.647102Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 112274 \"-\" \"python-httpx/0.25.2\" 1.946678 \n"
"2024-05-27T04:24:06.112081Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 88231 \"-\" \"python-httpx/0.25.2\" 2.080808 \n"
"2024-05-27T04:25:44.231897Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 56028 \"-\" \"python-httpx/0.25.2\" 1.056930 \n"
"2024-05-27T04:26:01.933222Z INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 57710 \"-\" \"python-httpx/0.25.2\" 1.935638 \n"
"2024-05-30T05:39:29.992428Z INFO actix_web::middleware::logger: 10.0.9.92 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 8447 \"-\" \"python-httpx/0.25.2\" 24.165537 \n"
- read-lock warning logs appears
"2024-05-18T01:40:19.350616Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-18T01:40:32.772502Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-18T01:40:56.198888Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-18T01:41:09.620751Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:32:04.972081Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:32:18.393947Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:32:45.237588Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:33:38.924771Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:35:26.299061Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:39:01.047518Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:43:33.908206Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:43:47.330069Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:44:14.173709Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:45:07.860895Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:46:55.235176Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:50:29.983639Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:56:10.275125Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:56:23.696990Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:56:50.540626Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:57:44.227817Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T05:59:31.602091Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
"2024-05-30T06:03:06.350550Z WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates. \n"
- can not found any timeout logs
- picture below are some health check calling /healthz, server node are unhealthy since 13:40, time shows in it is UTC+8
- I confirm that we're no any snapshots operation, we already done it several hours ago
- machin resouces looks good, below's time also UTC+8, we reboot the server node around 14:05 UTC+8 which causing disk io read increase in img 4, and there's a lower peak around 8:30 UTC+8 was the moment we take the snapshot
Just let me know if any other information can help to figure it out, thx!
That's very unfortunate. Thank you for the sharing all these details!
Just let me know if any other information can help to figure it out, thx!
Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours?
We'd love to hear about any leads that may help us reproduce this. Even if it seems insignificant.
I confirm that we're no any snapshots operation, we already done it several hours ago
I assume that this also means there were none still running in the background either.
@hugh2slowmo We've identified a case where the above issue can happen. It shows the same symptoms, such as that log message you shared.
We've prepared a patch here, and plan to release a patch version including it in the upcoming days.
Again. Sorry for the inconvenience.
We've released Qdrant 1.9.5 now which resolves this issue.
Since it is fixed, I'm closing this now. If you believe there's still a problem, please feel free to open it again.
Thank you for the detailed report!
@timvisee Sorry for the late reply, I've been tied up for the past few weeks, but fortunately, our Qdrant is running well since last reboot.
Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours?
Mostly, we do insertion and search, sometimes delete some points, and rarely update points.
We'll give new version a try later in a proper time, thanks for your excellent work!
Hi, looks like this version is still not available in the latest helm chart @timvisee !
Hi, looks like this version is still not available in the latest helm chart @timvisee !
I see! Thanks for the head up. I'll bump our cloud team to update it for Qdrant 1.9.5.
The Helm chart has now been updated: https://github.com/qdrant/qdrant-helm/releases
Hi @timvisee I'm running on 1.10.1 and encountering a similar error still, unfortunately.
On the server-side, I get: WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.
And on the client-side, I get the following python traceback:
Traceback (most recent call last):
File "/home/joshua/vector-database/pyupload_contigs.py", line 83, in <module>
client.upsert(
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 1180, in upsert
return self._client.upsert(
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 1540, in upsert
grpc_result = self.grpc_points.Upsert(
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 277, in __call__
response, ignored_call = self._with_call(
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
return call.result(), call
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
raise self
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
response, call = self._thunk(new_method).with_call(
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1198, in with_call
return _end_unary_response_blocking(state, call, True, None)
File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-07-23T15:52:33.58710038+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
The workflow that causes this is repeated upserts (I'm uploading a dataset into qdrant). I have some Rust code that I was planning on experimenting with, but it's not trivial because the dataset library is easiest to load using python.