microceph icon indicating copy to clipboard operation
microceph copied to clipboard

Expose complete node removal process via the microcluster API

Open masnax opened this issue 2 years ago • 4 comments

Currently, node removal is coupled with the CLI here with multiple API calls: https://github.com/canonical/microceph/blob/aae9f94dd349132283b08902b23aee27795e60fe/microceph/cmd/microceph/cluster_remove.go#L49

This means anyone using the MicroCeph API (MicroCloud) must re-implement the logic here, and maintain parity going forward, which is not ideal.

MicroCluster has hooks that can be used to automatically execute these necessary bits before/after a node is removed: https://github.com/canonical/microcluster/blob/e3b7b60a4d71577f5d0f5460f71dbb33dfcb5d9c/config/hooks.go#L25-L28

If the logic in removeNode is included in the remove hooks, then the only thing the MicroCeph CLI will need to do is call DeleteClusterMember which is currently here: https://github.com/canonical/microceph/blob/aae9f94dd349132283b08902b23aee27795e60fe/microceph/cmd/microceph/cluster_remove.go#L75

Then, anyone using the MicroCeph API would also be able to just remove the node and maintain all the necessary preparations for node removal.

masnax avatar Nov 08 '23 11:11 masnax

The hooks are currently included here, when the daemon starts: https://github.com/canonical/microceph/blob/aae9f94dd349132283b08902b23aee27795e60fe/microceph/cmd/microcephd/main.go#L64-L84

masnax avatar Nov 08 '23 12:11 masnax

Currently the removeNode() function takes a force flag that changes the semantics of that function

{force=false}

  1. checks safety constraints of node removal (preserving at least 3 OSDs, ensures no MONs are present)
  2. aborts node removal on errors

{force=true}

  1. skips safety constraints
  2. turns errors into warnings for logging but continues with node removal

How would those semantics be preserved in the node removal hook?

Happy to discuss in person if that works better for you

sabaini avatar Nov 09 '23 06:11 sabaini

Two ways I can think to do this: One is add your own endpoint for node removal (like DeleteClusterMember), which updates the PreRemove hook with the appropriate function based on a query parameter, and in between calls to that endpoint, the PreRemove hook can be set to a default function that will error out, so that if someone tries to use the endpoint corresponding to the built-in DeleteClusterMember.

The other approach is something I had spoken to @UtkarshBhatthere about how to propagate the user-specified subnet for bootstrapping ceph mons during initialization to the microcluster hooks. I think this would work in a similar way.

What you can do is wrap microcluster and the endpoint handlers in a struct so that you can manage the state with your own endpoints, which can then update the hooks.

We do this in MicroCloud right now with a Handler struct that's instantiated with the microcloudd daemon:https://github.com/canonical/microcloud/blob/7edfb6fef138d6b2b98ffcc0e6cdb71851d4a501/microcloud/cmd/microcloudd/main.go#L80-L83

The endpoint handlers can then all include the state: https://github.com/canonical/microcloud/blob/7edfb6fef138d6b2b98ffcc0e6cdb71851d4a501/microcloud/cmd/microcloudd/main.go#L133-L140 https://github.com/canonical/microcloud/blob/7edfb6fef138d6b2b98ffcc0e6cdb71851d4a501/microcloud/api/services.go

The hooks can then become methods on that struct so you can add an endpoint that updates the body of hooks like this: https://github.com/canonical/microcloud/blob/7edfb6fef138d6b2b98ffcc0e6cdb71851d4a501/microcloud/service/service_handler.go#L81 https://github.com/canonical/microcloud/blob/7edfb6fef138d6b2b98ffcc0e6cdb71851d4a501/microcloud/service/microcloud.go#L61

masnax avatar Nov 09 '23 07:11 masnax

@UtkarshBhatthere @sabaini There is now a force flag in upstream microcluster in both the PreRemove and PostRemove hooks, which use the same value passed to DeleteClusterMember. Hopefully that will unblock this.

masnax avatar Nov 29 '23 18:11 masnax