crate icon indicating copy to clipboard operation
crate copied to clipboard

Reserve CPU for system operations

Open SStorm opened this issue 2 years ago • 2 comments

Problem Statement

With the right set of heavy queries, it is possible to exhaust the entire available CPUs on a CrateDB Cloud cluster. The cluster at that point becomes unresponsive and difficult to debug (i.e. not possible to query the sys.jobs and sys.jobs_log and other system tables).

It would be super useful if CrateDB had some form of QoS for threadpools, and always reserved a fraction of a CPU for system management operations.

An inspiration for this is the disk high watermark on Linux, where the last 5% (configurable) is reserved for root.

Possible Solutions

If QoS is not possible, can we somehow tweak the thread pool sizes when starting a cluster? This is if there is a separate thread pool for management/system operations.

Considered Alternatives

No response

SStorm avatar May 05 '23 12:05 SStorm

I suspect the problem here isn't that threads aren't given enough CPU time, but rather that there is too much GC load. You should be seeing GC warnings in the logs.

The kernel scheduler should already ensure that each thread receives it's share of CPU time. The thread pools add some kind of QoS as you describe it. We use them to deal with blocking IO and run some kind of queries on different thread pools. E.g. system queries that don't hit IO run directly in the netty thread pool. Regular SELECTs on user tables run in the search thread pool and sys.shards queries use the get thread pool.

We already track various other improvements that would reduce GC load, or would help prevent GC load escalating, so I'm closing this.

Some of the more general ones:

  • https://github.com/crate/crate/issues/10063
  • https://github.com/crate/crate/issues/10505
  • https://github.com/crate/crate/issues/13956

mfussenegger avatar May 08 '23 07:05 mfussenegger

Re-opening this. Maybe we've cases where we overload the netty workers - which could lead to no longer processing new requests (e.g. follower checks & pings)

~But needs some more investigation~

Update: Looks like many issues are caused by the current query scheduling approach. table-scans can use up the search thread-pool, which causes other queries to queue up.

To address that it could be an option to tweak query scheduling. (E.g. some cooperative approach where RowConsumers yield after they ran for some time, or some other work stealing approach)

mfussenegger avatar Jun 27 '23 12:06 mfussenegger

Status: This and similar tickets are now essentially on hold. One investigation/repro I did recently strongly hinted that Kubernetes CPU throttling causes the situations when you no longer can connect and login to the cluster. So we'll have to wait until Cloud team is able to validate that hypothesis.

The exception would be if similar behavior can be observed (and reproduced) outside of CrateDB Cloud.

henrikingo avatar Jan 07 '25 11:01 henrikingo

Looks like was accidentally closed

BaurzhanSakhariev avatar Feb 18 '25 10:02 BaurzhanSakhariev

Looks like was accidentally closed

The 6.1 board had a workflow to close issues when added to should instead of done. Fixed it.

mfussenegger avatar Feb 18 '25 11:02 mfussenegger

Just recording here: When I read up on Lucene 10 release notes, one of the main performance improvements was increased concurrency for a single query. AND these improvements have largely also been ported to Lucene 9.x. So it is possible that this issue has creeped up on us as a result of upgrading Lucene 9 every now and then.

henrikingo avatar Feb 18 '25 12:02 henrikingo

~~>The cluster at that point becomes unresponsive and difficult to debug (i.e. not possible to query the sys.jobs and sys.jobs_log and other system tables).~~

~~Another random idea: Can we use MANAGEMENT thread pool (that is scaling and not bound) for selects on system tables to at least keep system tables available? Or introduce SYSTEM_SEARCH pool and make it unbound.~~

~~This is of course not a replacement for the idea to tweak query scheduling (because doesn't cover user tables), rather complementary thing that is also a low hanging fruit for a narrow scope.~~

UPD: Actually, cluster being responsive only for sys tables is not good and even can be misleading - one can assume that SEARCH query is exhausted but it could be another reason, better to have a generic solution for all tables

BaurzhanSakhariev avatar Feb 18 '25 12:02 BaurzhanSakhariev

~>The cluster at that point becomes unresponsive and difficult to debug (i.e. not possible to query the sys.jobs and sys.jobs_log and other system tables).~

~Another random idea: Can we use MANAGEMENT thread pool (that is scaling and not bound) for selects on system tables to at least keep system tables available? Or introduce SYSTEM_SEARCH pool and make it unbound.~

~This is of course not a replacement for the idea to tweak query scheduling (because doesn't cover user tables), rather complementary thing that is also a low hanging fruit for a narrow scope.~

UPD: Actually, cluster being responsive only for sys tables is not good and even can be misleading - one can assume that SEARCH query is exhausted but it could be another reason, better to have a generic solution for all tables

We already use different pools depending on the query:

https://github.com/crate/crate/blob/9d813a4569419d6c44b60c7b8d4ac91a913746ea/server/src/main/java/io/crate/execution/engine/collect/CollectTask.java#L251-L264

and sys.nodes for example even has timeouts integrated for other nodes to ensure the admin-ui works if a subset of the nodes are not reachable

Just recording here: When I read up on Lucene 10 release notes, one of the main performance improvements was increased concurrency for a single query. AND these improvements have largely also been ported to Lucene 9.x. So it is possible that this issue has creeped up on us as a result of upgrading Lucene 9 every now and then.

The concurrency improvements are only relevant if you pass a executor to a indexSearcher.search, we don't use that at the moment because our parallelism mechanism works a bit different.

mfussenegger avatar Feb 18 '25 13:02 mfussenegger

Folding this into https://github.com/crate/crate/issues/17646

mfussenegger avatar Mar 25 '25 14:03 mfussenegger