Retry query on another node if execution status != 200
Hello, I'm currently testing chproxy in test environment and I have a question about query execution. Let's say I have 2 nodes with replication and 4 zookeepers with 1 chproxy to balance rw queries between two nodes. Also, I have a stream of data from dozens of servers to chproxy. I have configured a health check to select specific path in the replicated table to make sure that both nodes have this tables and database itself. But, in my test env I've removed access to zookeeper from one of the node, what rendered database on the node readonly and health check select didn't mark the node as faulty. At the same time all INSERT requests to the readonly node exited with error code 500 and all failed INSERT requests are lost. Using /metrics I can see that chproxy can check for the query execution status, but I can't see any way to execute the fault query on another node if the response status from the node was not 200. Or, may be to store them for manual recovery. Am I missing something? Thanks!
Hello @wlp7s0, I'll try to reproduce it. I'd advice you to add a retry strategy on client side and rely on message bus before your insertion services - to be resilient to Clickhouse downtime.
Hi @wlp7s0 ,
I performed following test scenario:
- setup clickhouse cluster consisting of 4 nodes
-
chproxytargets that cluster. 4 nodes marked as healthy - manually kill one node
-
chproxymarked correctly killed node us unhealthy -
chproxyexcluded it from the list of available nodes
I fail to reproduce scenario you described. Could you please provide how to reproduce it?
Hello @gontarzpawel
How about another scenario status code 404 or etc?
for example, I have 3 nodes and 2 tables [A, B]
A table is replicated table and exists on all nodes, B table isn't replicated table and only exists on one node.
When I execute "select * from B" sometimes I have got the exception: Table B doesn't exist. (UNKNOWN_TABLE)
Is there any way when a table doesn't exist Chproxy try again on other nodes?
Also, I changed this line
https://github.com/ContentSquare/chproxy/blob/aeca5b7345fe6370f54d0fa048152c2f7066aad6/proxy.go#L215
to "if rw.StatusCode() != http.StatusOK"
but it hasn't worked yet.
IHMO in this situation you should fix your clickhouse config or rewrite your query to specify the server that contains table B using the remote syntaxe https://clickhouse.com/docs/en/sql-reference/table-functions/remote/
Regarding the retry-ability, we looked at the error codes returned by clickhouse and decided to do it only if it makes sens (i.e if a retry can make the failed query work). If we allow a retry on 404, everytime someone does a mistake, it will be retry despite the fact that it won't work and therefore it will slowdown the query response time.