netdata-cloud [Feat]: ability to delete unseen, stale or live nodes (other than Offline)

Problem

As a user i can only delete "offline" nodes from NC. I should be able to delete any nodes i want.

We need to split the problem into cases with node status as a key:

Online - We could either ban on the cloud level (not ideal) or instruct agent to disconnect by dropping cloud configuration. This is easy for directly connected nodes. More complicated case is when agent connects through claimed parent, set of parents or there are more than 1 parents in line for the node. We could disable streaming in such case, I think. Just ban on parent level from cloud connection only will mean that it will still collect the data from the node in question.
Stale - Same as above but display a warning that data for this node is going to be deleted too (we should instruct a parent(s) to do so - either by marking the data to be removed and letting garbage collector to do it's job or enforcing the operation directly).
Unseen - Just let me remove it and remove all the data that this particular node managed to imprint on the cloud - mostly DB entry and credentials for mqtt. I do not know if it is even possible to have an Unseen node connected through the parent so I have no idea about handling this case.
Offline - there is an ability to remove node already.

Example: I had a group of 11 nodes streaming to my parent. I deleted these VM's since i no longer need them. However i still see them in Netdata Cloud and am unable to delete them from NC.

Should i not be able to delete them? Unsure if this is a bug or feature request.

These nodes are gone an never coming back so i would like to remove them from NC. I guess maybe eventually the data for them might fall away on my parent and maybe then they would be offline in NC maybe and then i could delete perhaps. Unsure.

https://netdata-cloud.slack.com/archives/CS3PB0VJ7/p1671026396555759

Description

Cleaner infra view.
Control over the space without waiting X days for nodes to be marked as offline.
More freedom in testing things without a fear of injecting ghost nodes or the same node more than once (changing configuration by accident or on purpose might change the claimid)
Probably less ghost spaces - I imagine that user that just starts with NDC and tests it's capabilities might create a new space just to clean up the view.
I believe some users were confused when they first tried NDC because they couldn't delete the nodes that were either set up incorrectly or already switched off. It could be a cause for dropping the offering entirely, especially when dealing with dynamic environments.

Importance

must have

Value proposition

let me keep my space clean

Proposed implementation

No response

Dec 15 '22 09:12 andrewm4894

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/cant-delete-stale-nodes/3909/2

Mar 02 '23 10:03 netdata-community-bot

I found you can delete them if you delete the parent and re-install fresh on that parent machine. You have to remove the parent and all vnodes from the cloud dashboard and then when you re-claim the parent host it will set things up fresh.

Jun 16 '23 19:06 gdoermann

I tried to erase my historical data directly to see if that would clear it up, as a workaround until netdata makes an official way to do this. I opened up the list of stale nodes:

2023-11-23-002049_529x427_scrot

and mouse-over'd the stale node to delete and copied a link like https://EXAMPLE.ORG/v2/spaces/DOMAINTLD/rooms/local/nodes/888586af-e5ab-47f2-8094-c4948fd1243a.

Then I extracted the UUID and deleted the folder that holds its data on my parent node:

systemctl stop netdata
cd /var/lib/netdata
rm -r 888586af-e5ab-47f2-8094-c4948fd1243a ... # deleting each of the folders
systemctl start netdata

On rebooting, the charts are gone, but the node itself is still listed as "stale"

2023-11-23-003007_1366x768_scrot

So that wasn't enough.

I poked around some more and found this sqlite database:

root@monitor:~# sqlite3 /var/cache/netdata/netdata-meta.db
SQLite version 3.42.0 2023-05-16 12:36:15
Enter ".help" for usage hints.
sqlite> .headers on
sqlite> .tables
alert_hash          dimension           host                metadata_migration
chart               health_log          host_info           node_instance     
chart_label         health_log_detail   host_label        
sqlite> select * from host where hostname='host1.example.org';
host_id|hostname|registry_hostname|update_every|os|timezone|tags|hops|memory_mode|abbrev_timezone|utc_offset|program_name|program_version|entries|health_enabled
�9�ƃ!�����wK|host1.example.org|host1.example.org|15|linux|America/Toronto||1|5|EST|-18000|netdata|v1.33.1|0|1
��ER����
        �z���|host1.example.org|host1.example.org|15|linux|Etc/UTC||1|5|EST|-18000|netdata|v1.42.1|0|1

annoyingly, host_id, presumably the UUID, is stored in binary, while the rest is stored as text, but I was able to remove the entry with:

sqlite> delete from host where hostname='host1.example.org' and program_version='v1.33.1';

After another

root@monitor:~# systemctl restart netdata

the stale node is now gone from my dashboard. :tada:

Unfortunately this is not very clean. I believe there are still entries in the host_label and host_info and and node_instance tables referencing the deleted host_id, but I don't know how to input binary data in the sqlite CLI and I don't feel like digging out python right now to do it, so the garbage is just going to sit around.

Nov 23 '23 05:11 kousu

I had an installation problem with a node and now it's marked as "Stale" and "delete is disabled". The node is dead and will never be coming back. How do I get rid of this thing? Is there really no way to delete this??

Jun 13 '24 13:06 luckman212

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/impossible-to-delete-stale-node/5537/1

Jun 13 '24 13:06 netdata-community-bot

@netdata-community-bot funny. that's MY post.

Jun 13 '24 13:06 luckman212

+1 bumping this feature request. I'd hate to have to hack around in a database to be able to get rid of stale machines that landed there by accident, and waiting for the data to expire seems like an inelegant alternative.

edit: It turns out there is a way, but it's not GUI-friendly. Got this from https://community.netdata.cloud/t/impossible-to-delete-stale-node/5537/3

From app.netdata.cloud, navigate to your Node list

Next to the name of the Stale node, click on the little (i) symbol (View node information)

At the very bottom of the panel that opens to the right, you will see a "View node info in "json" button - click it. You should see a message that says “JSON copied to clipboard”

Paste that into a text editor.

Grab the value of the id: {...} key. This should be a string in UUID format, e.g. 6e072590-a422-45b2-bdab-cdd3fb14ad68

Connect to your parent node via SSH

Execute the following command: netdatacli remove-stale-node {uuid} substituting {uuid} above with your real one

Jul 30 '24 02:07 darxtorm

@darxtorm Until they make this easier, here are steps I took recently to remove a stale node, which were kindly provided by @ilyam8. Worked for me.

Combination of GUI and CLI

From app.netdata.cloud, navigate to your Node list
Next to the name of the Stale node, click on the little (i) symbol (View node information)
At the very bottom of the panel that opens to the right, you will see a "View node info in json" button - click it. You should see a message that says “JSON copied to clipboard”
Paste that into a text editor.
Grab the value of the id: {...} key. This should be a string in UUID format, e.g. 6e072590-a422-45b2-bdab-cdd3fb14ad68
Connect to your parent node via SSH
Execute the following command:
```
netdatacli remove-stale-node {uuid}
```
substituting {uuid} above with your real one, obviously…

CLI-only

ssh to the PARENT node
run netdatacli aclk-state
locate the stale node's UUID
run netdatacli remove-stale-node {uuid}

Jul 30 '24 02:07 luckman212

@darxtorm Until they make this easier, here are steps I took recently to remove a stale node, which were kindly provided by @ilyam8. Worked for me.

Absolutely, it's mildly clunky to say the least. Wanted to add that for cloud at least, after I had performed the above, I also had to go to Manage Space -> Nodes and perform a delete in there (the node was now showing as Offline rather than Stale, and the delete button was no longer disabled) to truly get rid of the ghost!

Jul 30 '24 02:07 darxtorm

I used netdatacli remove-stale-node on a bunch of stale nodes but it didn't have any effect — other than changing the Node ID in the netdatacli aclk-state from a UUID to null.

Is there something else I'm missing? Each time I'd run the command, it would say something like:

Unregistering node with machine guid 83fb052f-49ee-11ab-b00f-3e2f6b85cde4, hostname = dc413990ab4a

(We had a bunch of test containers spin up and they all "registered" with our (on-prem) Netdata instance and now I can't figure out how to remove them...)

Jul 31 '24 02:07 eddyg

@eddyg Restarting the parent node should make disappear from the UI.

Aug 19 '24 09:08 Redominus

@stelfrag see https://github.com/netdata/netdata-cloud/issues/690#issuecomment-2259543333, is it expected that a restart is required?

Aug 19 '24 09:08 ilyam8

@sashwathn hey, I think we need to allow removing stale nodes from the UI. It will simplify users life tremendously.

Aug 19 '24 09:08 ilyam8

@eddyg Restarting the parent node should make disappear from the UI.

Fixed in https://github.com/netdata/netdata/pull/18381

Aug 20 '24 08:08 ilyam8

Thanks for following up on this, Ilya!

Aug 20 '24 10:08 eddyg

Removing stale nodes:

by hostname
all stale nodes at once (when using the ALL_NODES keyword)

added in https://github.com/netdata/netdata/pull/18386

Aug 21 '24 08:08 ilyam8