chproxy Retry query on another node if execution status != 200

Hello, I'm currently testing chproxy in test environment and I have a question about query execution. Let's say I have 2 nodes with replication and 4 zookeepers with 1 chproxy to balance rw queries between two nodes. Also, I have a stream of data from dozens of servers to chproxy. I have configured a health check to select specific path in the replicated table to make sure that both nodes have this tables and database itself. But, in my test env I've removed access to zookeeper from one of the node, what rendered database on the node readonly and health check select didn't mark the node as faulty. At the same time all INSERT requests to the readonly node exited with error code 500 and all failed INSERT requests are lost. Using /metrics I can see that chproxy can check for the query execution status, but I can't see any way to execute the fault query on another node if the response status from the node was not 200. Or, may be to store them for manual recovery. Am I missing something? Thanks!

Apr 23 '21 12:04 wlp7s0

Hello @wlp7s0, I'll try to reproduce it. I'd advice you to add a retry strategy on client side and rely on message bus before your insertion services - to be resilient to Clickhouse downtime.

Jan 20 '22 17:01 gontarzpawel

Hi @wlp7s0 ,

I performed following test scenario:

setup clickhouse cluster consisting of 4 nodes
chproxy targets that cluster. 4 nodes marked as healthy
manually kill one node
chproxy marked correctly killed node us unhealthy
chproxy excluded it from the list of available nodes

I fail to reproduce scenario you described. Could you please provide how to reproduce it?

Jan 28 '22 15:01 gontarzpawel

Hello @gontarzpawel How about another scenario status code 404 or etc? for example, I have 3 nodes and 2 tables [A, B]
A table is replicated table and exists on all nodes, B table isn't replicated table and only exists on one node. When I execute "select * from B" sometimes I have got the exception: Table B doesn't exist. (UNKNOWN_TABLE) Is there any way when a table doesn't exist Chproxy try again on other nodes? Also, I changed this line https://github.com/ContentSquare/chproxy/blob/aeca5b7345fe6370f54d0fa048152c2f7066aad6/proxy.go#L215 to "if rw.StatusCode() != http.StatusOK" but it hasn't worked yet.

Jan 05 '23 14:01 ranjbaryshahab

IHMO in this situation you should fix your clickhouse config or rewrite your query to specify the server that contains table B using the remote syntaxe https://clickhouse.com/docs/en/sql-reference/table-functions/remote/

Regarding the retry-ability, we looked at the error codes returned by clickhouse and decided to do it only if it makes sens (i.e if a retry can make the failed query work). If we allow a retry on 404, everytime someone does a mistake, it will be retry despite the fact that it won't work and therefore it will slowdown the query response time.

Jan 07 '23 09:01 mga-chka