microceph Document backup/recovery procedures

Document how to backup the critical pieces so that you are able to restore a microceph instance back into a running cluster. Or how to remove it and rejoin etc.

May 20 '23 13:05 eedgar

@eedgar Thanks for the suggestion on how to make the documentation better. We'll be triaging this soon.

May 22 '23 20:05 pmatulis

document what to do in a 3 node cluster where one of the nodes hardware might have failed... what steps do you need to take .. is there any critical data that should have been backedup etc.

Also document what to do in event of a disk failure. is there a way to tell microceph its gone etc? do you follow regular ceph cleaning procedures?

May 27 '23 02:05 eedgar

Why is this issue so old? This is important as I JUST had a node in a three node cluster fail and am unable to get the rebuilt node to join the cluster. Not being able to recover is a show-stopper.

May 14 '24 00:05 john-terrell

It would be fantastic to have documentation on the steps to recover after a node failure, especially in a 3-node cluster. How do you reinstall a node, and re-join it to the cluster, and what do you need to backup (if anything) to aid the recovery of a failed node.

May 18 '24 09:05 DeepSkyWonder

I guess a neat way to make this easy would be the ability to clean-up lost node records. However, that change will only land on edge, for older deployments a document around the procedure should do it. I will take this up with priority.

May 23 '24 15:05 UtkarshBhatthere

@john-terrell have you ever managed to re-build the cluster? I have lost my node01 due to a hardware failure. Rebuilding the node now proves to be a real hassle. Even aber doing draining the node in ceph, manually removing it from microceph, I still get the following error:

Error: failed to record mon db entries: failed to record mon host: This "config" entry already exists

Nov 11 '24 20:11 petwri

I just had a need to recover a microceph node in a cluster using the reef/stable channel. All I needed was a backup of the /var/snap/microceph directory. Here is what I did after wiping and reinstalling Ubuntu on the node:

Install the same revision of microceph: sudo snap install microceph --channel reef/stable
Ensure you are using the same IP on the node as it previously used. I just copied over the same netplan config.
Disable and stop microceph:

sudo snap stop microceph
sudo snap disable microceph

Restore old snap directory:

sudo mv /var/snap/microceph /var/snap/microceph.bak
sudo mv <backup-microceph-dir> /var/snap/microceph

Start and enabled microceph:

sudo snap start microceph
sudo snap enable microceph

Check and manually enable/start any inactive services:

snap services
sudo systemctl enable --now snap.microceph.<service>

Test a full restart of the snap:

sudo snap restart microceph

Confirm functionality:

sudo microceph status
sudo microceph.ceph status

Some caveats:

The snap dir includes symlinks and socket files, so ensure your backup respects those special file types.
If you cannot assign the same IP as before, you can still recover the OSDs with the same steps. Only the mon won't work on a different IP address. This may be manually recoverable, but I didn't test.

Maybe someone from the microceph team can validate and confirm this process or point out any issues with it?

Dec 18 '24 17:12 slapcat