qdrant Some collections stop responding while the server seems ok

Current Behavior

version: 1.9.1 We facing some weird not response issue yesterday, some of the collections stop working on any request including /search /search/batch /scroll, even /collections/collection_name, i try to get collection status to give me some insight, but it hangs, i also try using curl on server itself so there may not any kind of network issues, and there are no any timeout logs, we also check the cpu, mem, disk io, but everything looks quite well, for make our app back in life quickly, we choose to reboot qdrant then. Some related machine info shows below, the peek in the middle is cause by a short query benchmark to make sure it works fine after we reboot:

Steps to Reproduce

Every thing looks good since we restarted it yesterday, don't know how to reproduce. Through digging around logs i found one thing that i don't understand but might lead to some issues:

2024-05-07T04:23:15.180072Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:23:28.601938Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:23:55.445584Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:24:49.132774Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:26:36.507051Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:26:57.573629Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1960 "-" "python-httpx/0.25.2" 0.003233    
2024-05-07T04:26:57.911320Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1773 "-" "python-httpx/0.25.2" 0.002333    
2024-05-07T04:27:01.561033Z  INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000456    
2024-05-07T04:27:04.309453Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 16240 "-" "python-httpx/0.25.2" 0.013256    
2024-05-07T04:30:11.255518Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:30:35.034610Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:30:48.456474Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:31:15.300084Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:32:01.525581Z  INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000441    
2024-05-07T04:32:08.987269Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:33:56.361553Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:37:01.584066Z  INFO actix_web::middleware::logger: 10.0.10.198 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000424    
2024-05-07T04:37:14.223721Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1722 "-" "python-httpx/0.25.2" 0.002675    
2024-05-07T04:37:31.110006Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:01.258506Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:14.680405Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:41.524051Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:39:35.211267Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:41:22.585551Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:42:01.696012Z  INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000374    
2024-05-07T04:44:23.131041Z  INFO actix_web::middleware::logger: 3.113.30.49 "GET /dashboard/logo192.png HTTP/1.1" 200 8870 "-" "Slack-ImgProxy (+https://api.slack.com/robots)" 0.008243    
2024-05-07T04:44:57.334028Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:45:44.079306Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:45:57.501191Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:46:24.344836Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:47:01.592443Z  INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000460    
2024-05-07T04:47:18.032024Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:47:58.042329Z  INFO actix_web::middleware::logger: 10.0.10.99 "POST /collections/entities/points/search/batch HTTP/1.1" 200 3321 "-" "python-httpx/0.25.2" 0.003748    
2024-05-07T04:47:58.353717Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1184 "-" "python-httpx/0.25.2" 0.003491    
2024-05-07T04:47:58.354665Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1620 "-" "python-httpx/0.25.2" 0.004120

And Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates also appears in today's logs, i'm not sure what makes these logs keeps showing and what possible issues they may lead to. Hope you guys can provide some solution or insight to figure it out, thanks!

May 08 '24 04:05 hugh2slowmo

hey @hugh2slowmo, thanks for reporting! Will try to reproduce it on our side

May 08 '24 09:05 generall

Question: what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?

Also, are you making snapshots by any chance?

May 08 '24 11:05 timvisee

hi @timvisee

what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?

The batch searches shown in logs are request to 'entities' collection, it performs normal in that time. We noticed this issue around 2024-05-07 02:30 UTC, map to image which using CST needs to plus 8, around 10:30, sorry about that confusing timeline. And due to deployment and some reason, we lost some logs, seems i'm providing not related logs to you guys, my bad. And we're getting not response issue on a collection call 'texts', in our app will send points/search and points/search/batch request to it most the time, and we try to send /collections/texts and points/scroll via dashboard's console and curl from remote and server localhost. The most weird thing is that there's no any logs showing, i doubt there's network issues at first, then i try curl from server, which is curl http://localhost:6333/collections/texts, and it also not responding, and some other collections are serving normally from inside and outside traffic, so i have no idea what's happening out there. By that reason, i cannot provide collection status and info at that time.

Also, are you making snapshots by any chance?

We normally take snapshots on midnight trigger by ops scripts, and i didn't see any snapshots request logs at least.

May 09 '24 04:05 hugh2slowmo

Maybe it's not a easy thing to reproduce by the information i provided. Or should we use TRACE log level in production? I guess it might be helpful when something bizarre happens.

Any performance impact by setting to that?

May 09 '24 04:05 hugh2slowmo

Thank you for elaborating.

Yeah this is definitely hitting a deadlock. It means things get stuck and access to a specific collection is blocked. That's why you're seeing the above mentioned behavior. Of course, this should never happen no matter what requests you send to it.

It looks like I've been able to reproduce it somewhat reliably locally, which makes it easier to debug this.

You can enable trace logging, but I'm not sure how valuable it might be. It should not have a significant impact on performance, but there will be a lot more logs to store. Maybe we can give this a try later if I cannot resolve this locally.

May 09 '24 07:05 timvisee

I've located and fixed a deadlock issue in https://github.com/qdrant/qdrant/pull/4206, and am quite confident it'll fix the issue you're seeing as well.

Once merged our plan is to release a patch release shortly after. Once that's done, maybe you can try to upgrade to see whether you see the same issue again. :crossed_fingers:

May 09 '24 10:05 timvisee

Well done Tim, we'll give it a try when the next patch has been released.

May 10 '24 05:05 hugh2slowmo

I've just published the release: https://github.com/qdrant/qdrant/releases/tag/v1.9.2

Please let us know if it resolves the problems you're experiencing.

May 10 '24 09:05 timvisee

@timvisee Hey Tim, seems we met the issue again, we're now using v1.9.2

curl localhost:6333
{"title":"qdrant - vector search engine","version":"1.9.2","commit":"34f7f8ec2d8455c0b3fe0f1fb134907ba7816a08"}

and i try to see if there some slow query which response time greater than 1 second, but only a few through multiple days, logs time using default UTC+0

"2024-05-22T07:23:19.627902Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 313 \"-\" \"python-httpx/0.25.2\" 1.988359    \n"
"2024-05-22T07:23:19.750302Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 767 \"-\" \"python-httpx/0.25.2\" 2.453608    \n"
"2024-05-22T07:23:20.177431Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 908 \"-\" \"python-httpx/0.25.2\" 2.654047    \n"
"2024-05-24T02:52:01.997137Z  INFO actix_web::middleware::logger: 10.0.10.28 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 11457 \"-\" \"python-httpx/0.25.2\" 2.598561    \n"
"2024-05-25T00:39:53.713344Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 12346 \"-\" \"python-httpx/0.25.2\" 2.015387    \n"
"2024-05-26T03:07:59.902028Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 10301 \"-\" \"python-httpx/0.25.2\" 1.368371    \n"
"2024-05-27T00:43:35.108451Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 13381 \"-\" \"python-httpx/0.25.2\" 1.037464    \n"
"2024-05-27T04:12:22.801538Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 121083 \"-\" \"python-httpx/0.25.2\" 2.275867    \n"
"2024-05-27T04:12:45.348556Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 104146 \"-\" \"python-httpx/0.25.2\" 1.028009    \n"
"2024-05-27T04:13:22.647102Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 112274 \"-\" \"python-httpx/0.25.2\" 1.946678    \n"
"2024-05-27T04:24:06.112081Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 88231 \"-\" \"python-httpx/0.25.2\" 2.080808    \n"
"2024-05-27T04:25:44.231897Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 56028 \"-\" \"python-httpx/0.25.2\" 1.056930    \n"
"2024-05-27T04:26:01.933222Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 57710 \"-\" \"python-httpx/0.25.2\" 1.935638    \n"
"2024-05-30T05:39:29.992428Z  INFO actix_web::middleware::logger: 10.0.9.92 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 8447 \"-\" \"python-httpx/0.25.2\" 24.165537    \n"

read-lock warning logs appears

"2024-05-18T01:40:19.350616Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:40:32.772502Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:40:56.198888Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:41:09.620751Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:04.972081Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:18.393947Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:45.237588Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:33:38.924771Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:35:26.299061Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:39:01.047518Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:43:33.908206Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:43:47.330069Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:44:14.173709Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:45:07.860895Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:46:55.235176Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:50:29.983639Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:10.275125Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:23.696990Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:50.540626Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:57:44.227817Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:59:31.602091Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T06:03:06.350550Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"

can not found any timeout logs
picture below are some health check calling /healthz, server node are unhealthy since 13:40, time shows in it is UTC+8
I confirm that we're no any snapshots operation, we already done it several hours ago
machin resouces looks good, below's time also UTC+8, we reboot the server node around 14:05 UTC+8 which causing disk io read increase in img 4, and there's a lower peak around 8:30 UTC+8 was the moment we take the snapshot

Just let me know if any other information can help to figure it out, thx!

May 30 '24 07:05 hugh2slowmo

That's very unfortunate. Thank you for the sharing all these details!

Just let me know if any other information can help to figure it out, thx!

Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours?

We'd love to hear about any leads that may help us reproduce this. Even if it seems insignificant.

I confirm that we're no any snapshots operation, we already done it several hours ago

I assume that this also means there were none still running in the background either.

May 30 '24 13:05 timvisee

@hugh2slowmo We've identified a case where the above issue can happen. It shows the same symptoms, such as that log message you shared.

We've prepared a patch here, and plan to release a patch version including it in the upcoming days.

Again. Sorry for the inconvenience.

Jun 06 '24 13:06 timvisee

We've released Qdrant 1.9.5 now which resolves this issue.

Since it is fixed, I'm closing this now. If you believe there's still a problem, please feel free to open it again.

Thank you for the detailed report!

Jun 11 '24 16:06 timvisee

@timvisee Sorry for the late reply, I've been tied up for the past few weeks, but fortunately, our Qdrant is running well since last reboot.

Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours?

Mostly, we do insertion and search, sometimes delete some points, and rarely update points.

We'll give new version a try later in a proper time, thanks for your excellent work!

Jun 12 '24 09:06 hugh2slowmo

Hi, looks like this version is still not available in the latest helm chart @timvisee !

Jun 17 '24 15:06 samikrc

Hi, looks like this version is still not available in the latest helm chart @timvisee !

I see! Thanks for the head up. I'll bump our cloud team to update it for Qdrant 1.9.5.

Jun 17 '24 15:06 timvisee

The Helm chart has now been updated: https://github.com/qdrant/qdrant-helm/releases

Jun 17 '24 15:06 timvisee

Hi @timvisee I'm running on 1.10.1 and encountering a similar error still, unfortunately.

On the server-side, I get: WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.

And on the client-side, I get the following python traceback:

Traceback (most recent call last):
  File "/home/joshua/vector-database/pyupload_contigs.py", line 83, in <module>
    client.upsert(
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 1180, in upsert
    return self._client.upsert(
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 1540, in upsert
    grpc_result = self.grpc_points.Upsert(
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 277, in __call__
    response, ignored_call = self._with_call(
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
    return call.result(), call
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
    raise self
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
    response, call = self._thunk(new_method).with_call(
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1198, in with_call
    return _end_unary_response_blocking(state, call, True, None)
  File "/home/joshua/vector-database/.venv/lib/python3.10/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-07-23T15:52:33.58710038+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"

The workflow that causes this is repeated upserts (I'm uploading a dataset into qdrant). I have some Rust code that I was planning on experimenting with, but it's not trivial because the dataset library is easiest to load using python.

Jul 23 '24 15:07 jlkravitz