clickhouse-operator Correct way to upgrade operator from 1.23.7 to 1.24+ with multiple clickhouse keeper replicas

Hello I am following https://github.com/Altinity/clickhouse-operator/blob/0.24.0/docs/keeper_migration_from_23_to_24.md instruction to upgrade operator from 1.23.7 to 1.24.5

Unfortunately I had encountered an issue during the upgrade of installation with 3 replicas of clickhouse keepers: While with old operator version all keepers located in the single statefulset which has a single service; with 1.24.5 a separate statefulset and service is created for each replica. Which basically changes the addresses of all keeper replicas. For example in my installation for version 1.23.7 I had 3 keeper addresses: clickhouse-keeper-logging-0.clickhouse-keeper-logging-headless clickhouse-keeper-logging-1.clickhouse-keeper-logging-headless clickhouse-keeper-logging-2.clickhouse-keeper-logging-headless

And after the update they changed to: chk-clickhouse-keeper-logging-default-0-0 chk-clickhouse-keeper-logging-default-1-0 chk-clickhouse-keeper-logging-default-2-0

I saw that <raft_configuration> was indeed changed to a correct one after an update:

<raft_configuration>
    <server>
        <id>0</id>
        <hostname>chk-clickhouse-keeper-logging-default-0-0</hostname>
        <port>9444</port>
    </server>
    <server>
        <id>1</id>
        <hostname>chk-clickhouse-keeper-logging-default-0-1</hostname>
        <port>9444</port>
    </server>
    <server>
        <id>2</id>
        <hostname>chk-clickhouse-keeper-logging-default-0-2</hostname>
        <port>9444</port>
    </server>
</raft_configuration>

However even having the updated raft_configuration, keeper replicas were unable to reach each other, as hosts stored in replication log were not updated in the process and replicas continued using old addresses

clickhouse-keeper-logging-1:/$ clickhouse-keeper client -h localhost --port 2181 -q "get '/keeper/config'"
server.0=clickhouse-keeper-logging-0.clickhouse-keeper-logging-headless:9444;participant;1
server.2=clickhouse-keeper-logging-2.clickhouse-keeper-logging-headless:9444;participant;1
server.1=clickhouse-keeper-logging-1.clickhouse-keeper-logging-headless:9444;participant;1

I found no easy way to make clickhouse keeper reload host configuration from disk or similar issues, but maybe I am missing something here?

What I tried:

Configuring the cluster for keeper hosts to have the same addresses as before I failed to come to any solution without keeping and maintaining additional k8s objects which is not desirable, as configuration had changed and we now have 3 services for 3 keeper replicas instead of one.
Adding new hosts using zookeeper reconfigure command Doesnt seem possible to incrementally change /keeper/config to the state matching raft_configuration, as we can not reuse server ids
Starting clickhouse keeper with --force-recovery flag for a time With that flag clickhouse keeper loads configuration from raft_configuration into /keeper/config, however that presumably damages replication log in some cases in my upgrade process and leads to replication failures shortly after upgrade, I am still investigating why am I having issues with this approach.

Jun 06 '25 19:06 pekashy

I am in the process of upgrading too. My keeper installation name is clickhouse-keeper and I see it created a service (along with individual services for each Keeper pod) called keeper-clickhouse-keeper.

In my ClickHouseInstallation yaml I am referring the host as

zookeeper:
          nodes:
            - host: keeper-clickhouse-keeper

And it works. I tried creating a table and the definition was available on each server pods.

I just wanted to share what I did, hopefully this is correct way to access Keeper.

Jun 09 '25 16:06 shahsiddharth08

@Slach is it correct to do so?

Jun 09 '25 16:06 shahsiddharth08

@shahsiddharth08 Thank you for your reply I use this setting in ClickHouseInstallation too, and I also do not have connection problems between clickhouse server and keeper.

However I do have problem with keeper replicas connecting to each other after update. How many keeper replicas do you have in your installation?

Jun 09 '25 18:06 pekashy

However I do have problem with keeper replicas connecting to each other after update.

What do you mean connecting each other? Like ping?

I have 5 replicas.

Jun 11 '25 15:06 shahsiddharth08

I finally found a way to update my cluster with almost none read downtime and about 15 mins write downtime (for my case). The key was making sure there are no connections going between clcikhouse servers and keeper during the update and always using --force-recovery only on keeper leaders. In the update process I copied data from old keeper to a new one instead of reattaching PVs to PVCs as was recommended in operator upgrade instruction, as working with PV was not allowed in my cluster and copying the data made sure I always have an unspoiled backup of replication log. In the end I manually patched server configmaps with a new keeper address and restarted servers to reduce write downtime at a price of short read downtime.

I am attaching the script I used to update my cluster in rather raw form, but I believe it might be useful to someone encountered the same problem update-script.zip

Meanwhile I am still curious if there was an easier way to do this

Jun 23 '25 16:06 pekashy

@pekashy thank you so much for sharing your experience

Jun 23 '25 17:06 Slach