go-control-plane icon indicating copy to clipboard operation
go-control-plane copied to clipboard

Deadlock in Delta XDS if number of resources is more than 10

Open lobkovilya opened this issue 2 years ago • 8 comments

In Kuma we're implementing the ResourceSnapshot interface and our snapshot has more than 20 distinct resource types.

After upgrading to v0.12.0 we see deadlocks on the server and it seems like it was caused by https://github.com/envoyproxy/go-control-plane/pull/752. AFAICT the potential deadlock was fixed by increasing the channel capacity, but it's not fixed for us since our snapshot has more resource types.

I'm looking for help or advice on what'd be the best way to fix it? It doesn't seem right that server depends on the number of resource types in the snapshot to work properly.

lobkovilya avatar Feb 07 '24 11:02 lobkovilya

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 08 '24 12:03 github-actions[bot]

There is a PR open so I wouldn't call it stale.

slonka avatar Mar 08 '24 12:03 slonka

Some additional context.

Why the deadlock appeared after https://github.com/envoyproxy/go-control-plane/pull/752?

The PR replaces a response channel per watch model with a response channel per all watches. This was important to support ADS ordering, but at the same time, xDS was affected by the change.

Why there is no deadlock in SotW?

SotW has 2 different implementations for xds and ads. ADS has a response channel per all watches while xDS still has a response channel per watch.

How the deadlock can be solved in Delta?

  1. Introduce proper xDS implementation, so that the Delta server has both xds.go and ads.go similar to the SotW server.
  2. Let the capacity of a response channel be configurable (see my PR https://github.com/envoyproxy/go-control-plane/pull/876)
  3. Build some kind of an "unbounded channel" abstraction when the writer never blocks and the internal buffer is reallocated and shrunk dynamically (I know it's a tiny bit crazy idea 😅).

Probably there are some other options I haven't thought of. @valerian-roche @jpeach @alecholmez please take a look, I'd appreciate any input here, and I'd be happy to contribute the solution.

lobkovilya avatar Mar 29 '24 20:03 lobkovilya

Hey, sorry for the delay, I have not had much capacity to work on this recently. In my opinion the first proposal is not going in the correct direction. The sotw server implementation without ads ordering will also go away, as it is really inefficient on some aspects, and require additional maintenance effort. The second one as discussed would imo be pushing back dangerous complexity on the user. The risk of deadlock might be very hard to assess if an error is made by 1 for instance. My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

I plan on spending some time on the control-plane in Q2, though likely not in the coming weeks. If you want to take a stab at an implementation I can review it.

valerian-roche avatar Apr 02 '24 00:04 valerian-roche

Thanks for the reply @valerian-roche!

My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

Just checking my understanding, does it mean the select statement for Delta has to be re-implemented as reflect.Select in SotW? https://github.com/envoyproxy/go-control-plane/blob/f2468473b7cde3c30a6824aacf350ff3c2a91c19/pkg/server/sotw/v3/xds.go#L53

So that each "not explicitly defined" type adds a SelectCase with its own watch.response channel?

lobkovilya avatar Apr 02 '24 08:04 lobkovilya

Hey, I'd like to not go in this direction. The old model of sotw is fundamentally different from the model of delta (even before the ads change), and aligning with it will require maintaining two entirely different loops.

Prior to the ads change delta used to fork a goroutine for each watch. I believe this can be reused if the resource of the watch is not a standard xds resource considered within the main channel buffering.

This would likely require having another muxed channel for those watches, as queuing in the other one could lead to deadlock, but adding an entry on the select should not create an issue. Overall I expect the change to be:

  • re-add support for a channel in watch to be closed on cancel. It will be nil in the general case, which is not an issue
  • add a new case in the select with a channel buffered with a "default" capacity (not fully mattering in this context) which will run the send code without the unneeded "process other" here as order is irrelevant

valerian-roche avatar Apr 03 '24 02:04 valerian-roche

If my understanding is correct, having another muxed channel for those watchers with the "default" capacity can still lead to deadlocks. Imagine two things happening simultaneously:

  1. a goroutine on the server calls SetSnapshot
  2. at the same time, the server receives a DeltaDiscoveryRequest

If the Snapshot contains more resource types than the "default" capacity, the SetSnapshot function might block when attempting to write to the new muxed channel, and during this block, it holds the cache mutex. At the same time, the server gets a req from <-reqCh in a select statement, it attempts to call either watch.Cancel() or s.cache.CreateDeltaWatch(...). However, this action also gets blocked because the cache mutex is already held by the first goroutine. That's how we get a deadlock today.

That's why I brought up the reflect.Select. The ability to dynamically change select cases eliminates potential deadlocks. Regardless of the number of types in the Snapshot, the SetSnapshot never blocks with reflect.Select. So it's not really about aligning the code with SotW, I just don't see how else we can avoid the deadlock.

lobkovilya avatar Apr 03 '24 17:04 lobkovilya

@valerian-roche what do you think? Does it make sense to go with reflect.Select in Delta because of the reasons I mentioned above?

lobkovilya avatar Apr 24 '24 15:04 lobkovilya