clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

High query load from CHOP

Open cw9 opened this issue 9 months ago • 3 comments

Sometimes when CHOP is reconciling, it would send very high query load to each pod in the cluster.

e.g.: Below shows the top 5 queries received in 2 minutes by one ClickHouse node in a cluster of 120, all come from the CHOP


SELECT
    query,
    count() AS cnt
FROM system.query_log
WHERE (event_time >= toDateTime('2025-04-16 23:46:00')) AND (event_time <= toDateTime('2025-04-16 23:48:00'))
GROUP BY query
ORDER BY cnt DESC
LIMIT 5

Query id: 94e7460f-36db-4f99-9837-dc3040a2b9f6

   ┌─query────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───cnt─┐
1. │ SELECT DISTINCT `database`, `name`, `create_table_query` FROM `system`.`tables`                                                                                                                  │ 10848 │
2. │ SELECT DISTINCT arrayJoin([`database`, extract(`engine_full`, 'Distributed\\([^,]+, *\'?([^,\']+)\'?, *[^,]+')]) AS `database` FROM `system`.`tables` AS `tables` WHERE `engine` = 'Distributed' │  6718 │
3. │ DESC TABLE system.tables                                                                                                                                                                         │   363 │
4. │ DESC TABLE system.functions                                                                                                                                                                      │   226 │
5. │ SELECT DISTINCT `name`, replaceRegexpOne(`create_query`, 'CREATE (FUNCTION)', 'CREATE \\1 IF NOT EXISTS') FROM `system`.`functions` AS `tables` WHERE `create_query` != ''                       │   225 │
   └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘

This caused the CPU on the ClickHouse to be overloaded and returned error TOO_MANY_SIMULTANEOUS_QUERIES .

Could you help me understand what could have caused this? Is there any config I can tune to make these queries less aggressive?

This is related to issue: https://github.com/Altinity/clickhouse-operator/issues/1685, because I think some of the high query load is caused by clusterAllReplicas used in some of these CHOP queries which amplified the load received by each ClickHouse node

cw9 avatar Apr 17 '25 13:04 cw9

Hi @cw9 , how many nodes do you have? It is very weird that counts for tables and functions are so different, those are called one by one. See https://github.com/Altinity/clickhouse-operator/blob/master/pkg/model/chi/schemer/replicated.go https://github.com/Altinity/clickhouse-operator/blob/master/pkg/model/chi/schemer/distributed.go

alex-zaitsev avatar Apr 22 '25 12:04 alex-zaitsev

I have a total of 120 nodes, this shows the queries received by one of the nodes, many nodes in the cluster received similar query load and they all got overloaded on CPU. This has happened several times.

cw9 avatar Apr 22 '25 14:04 cw9

You may disable schema propagation to new nodes completely, that will reduce the load to the cluster:

spec:
  configuration:
    clusters:
    - name: my-cluster
      schemaPolicy:
        replica: None
        shard: None

alex-zaitsev avatar May 26 '25 15:05 alex-zaitsev