clickhouse-java Health check not working properly

Describe the bug

The memory limitation of the ClickHouse service on a single node affects insertion, and after restarting ClickHouse, the application service does not recover and keeps looking for an available node. After the health check, it should switch from the faulty node to the healthy node.

Steps to reproduce

The application service runs continuously to insert data into ClickHouse.
Restart the ClickHouse node.
Observe whether the application service can insert data into ClickHouse.

Expected behaviour

After the clickhouse restarts, the application service should return to normal after observing for a period of time

Code example

        Properties properties = new Properties(config.getOptions());
        properties.setProperty(ClickHouseDefaults.USER.getKey(), config.getUser());
        properties.setProperty(ClickHouseDefaults.PASSWORD.getKey(), config.getPassword());

        properties.setProperty(ClickHouseClientOption.AUTO_DISCOVERY.getKey(),
            String.valueOf(config.getAutoDiscovery()));
        if (config.getAutoDiscovery()) {
            properties
                .setProperty(ClickHouseClientOption.NODE_DISCOVERY_INTERVAL.getKey(), "30000");
        }
        properties.setProperty(ClickHouseClientOption.LOAD_BALANCING_POLICY.getKey(), "roundRobin");
        properties.setProperty(ClickHouseClientOption.HEALTH_CHECK_INTERVAL.getKey(), "30000");
        properties.setProperty(ClickHouseClientOption.FAILOVER.getKey(), "3");
        properties.setProperty(ClickHouseClientOption.CHECK_ALL_NODES.getKey(), "1");
        properties.setProperty(ClickHouseClientOption.MAX_QUEUED_BUFFERS.getKey(), "0");
        properties.setProperty(ClickHouseClientOption.DECOMPRESS.getKey(), "1");
        try {
            return new ClickHouseDataSource(
                config.getEndpoint(), properties);
        } catch (SQLException e) {
            throw new RuntimeException("Failed to create Clickhouse data source", e);
        }
    }

Error log

Caused by: java.sql.SQLException: Failed to get single-node at com.clickhouse.jdbc.SqlExceptionUtils.clientError(SqlExceptionUtils.java:81) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.internal.ClickHouseConnectionImpl.(ClickHouseConnectionImpl.java:258) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:46) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:16) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ~[spring-jdbc-5.3.4.jar:5.3.4] ... 9 more Caused by: java.lang.IllegalArgumentException: ClickHouseNodes [checking=false, index=0, lock=r1w0, nodes=0, faulty=1, policy=RoundRobinPolicy, tags=[]]@305219537 does not contain suitable node for ClickHouseNodeSelector [protocols=[], tags=[]] at com.clickhouse.client.ClickHouseLoadBalancingPolicy$RoundRobinPolicy.get(ClickHouseLoadBalancingPolicy.java:94) ~[clickhouse-cli-client-0.3.2-patch11-shaded.jar:clickhouse-cli-client 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.client.ClickHouseNodes.apply(ClickHouseNodes.java:514) ~[clickhouse-cli-client-0.3.2-patch11-shaded.jar:clickhouse-cli-client 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.internal.ClickHouseConnectionImpl.(ClickHouseConnectionImpl.java:256) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:46) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:16) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ~[spring-jdbc-5.3.4.jar:5.3.4]

reason

The inconsistency between the nodes checked for health and the ones receiving actual requests is likely caused by the following code.

1. 
 public void check() {
        // exclusive access
        if (!checking.compareAndSet(false, true)) {
            return;
        }

        Set<ClickHouseNode> list = new LinkedHashSet<>();
        long currentTime = System.currentTimeMillis();
        boolean checkAll = template.config.getBoolOption(ClickHouseClientOption.CHECK_ALL_NODES);
        int healthyNodeStartIndex = -1;
        lock.readLock().lock();
        // TODO:
        // 1) minimize the list;
        // 2) detect flaky node and check it again later in a less frequent way
        try {
            pickNodes(faultyNodes, selector, list, groupSize, currentTime);
            if (checkAll) {
                healthyNodeStartIndex = list.size();
                pickNodes(nodes, selector, list, groupSize, currentTime);
            }
        } finally {
            checking.set(false);
            lock.readLock().unlock();
        }

        boolean hasFaultyNode = false;
        int count = 0;
        try {
            for (ClickHouseNode node : list) {
                ClickHouseNode n = node.probe();
                // probe is faster than ping but it cannot tell if the server works or not
                boolean isAlive = false;
                try (ClickHouseClient client = ClickHouseClient.builder().agent(false).config(n.config)
                        .nodeSelector(ClickHouseNodeSelector.of(n.getProtocol())).build()) {
                    isAlive = client.ping(n, n.config.getConnectionTimeout());
                } catch (Exception e) {
                    // ignore
                }
                if (!n.equals(node)) {
                    update(n, Status.MANAGED);
                    update(node, Status.STANDALONE);
                }

                if (isAlive) {
                    if (healthyNodeStartIndex < 0 || count < healthyNodeStartIndex) {
                        update(n, Status.HEALTHY);
                    }
                } else {
                    hasFaultyNode = true;
                    if (healthyNodeStartIndex >= count) {
                        update(n, Status.FAULTY);
                    }
                }
                count++;
            }
        } catch (Exception e) {
            log.warn("Unexpected error occurred when checking node status", e);
        } finally {
            if (checkAll || hasFaultyNode) {
                scheduleHealthCheck();
            }
        }
    }
2. 
    default boolean ping(ClickHouseNode server, int timeout) {
        if (server != null) {
            if (timeout < 1) {
                timeout = server.config.getConnectionTimeout();
            }
            if (server.getProtocol() == ClickHouseProtocol.ANY) {
                server = ClickHouseNode.probe(server.getHost(), server.getPort(), timeout);
            }
            try (ClickHouseResponse resp = read(server) // create request
                    .option(ClickHouseClientOption.ASYNC, false) // use current thread
                    .option(ClickHouseClientOption.CONNECTION_TIMEOUT, timeout)
                    .option(ClickHouseClientOption.SOCKET_TIMEOUT, timeout)
                    .option(ClickHouseClientOption.BUFFER_SIZE, 8) // actually 4 bytes should be enough
                    .option(ClickHouseClientOption.MAX_QUEUED_BUFFERS, 1) // enough with only one buffer
                    .format(ClickHouseFormat.TabSeparated)
                    .query("SELECT 1 FORMAT TabSeparated").execute()
                    .get(timeout, TimeUnit.MILLISECONDS)) {
                return resp != null;
            } catch (Exception e) {
                // ignore
            }
        }

        return false;
    }
3. 
    public final ClickHouseNode getServer() {
        ClickHouseNode node = serverRef.get();
        if (node == null) {
            node = server.apply(getClient().getConfig().getNodeSelector());
            if (!serverRef.compareAndSet(null, node)) {
                node = serverRef.get();
            }
        }
        return node;
    }

Configuration

Environment

Client version: 0.3.2-patch11
Language version:
OS:

ClickHouse server

ClickHouse Server version:
ClickHouse Server non-default settings, if any:
CREATE TABLE statements for tables involved:
Sample data for all these tables, use clickhouse-obfuscator if necessary

### Tasks

Jun 16 '23 07:06 wang-qijia

Thanks for the report @wang-qijia. It is indeed looks like a bug, I'll see if I can reproduce the issue at my end.

Jun 16 '23 10:06 zhicwu

@zhicwu Hi, is there any progress on this matter?

Jun 19 '23 02:06 wang-qijia