zos icon indicating copy to clipboard operation
zos copied to clipboard

Nodes with tfchain error don't update

Open scottyeager opened this issue 1 year ago • 2 comments

I noticed that I can't reach some mainnet nodes over RMB.

RMBError: 104 invalid envelope signature: sr25519 signature verification failed

Here's an example from the dashboard, when attempting to deploy a VM on node 1479:

image

Same result using the RMB proxy:

image

Here's a non exhaustive list of affected node ids on mainnet:

1087
1226
1479
1640
1723
1926
1966
2158
2723
4349

scottyeager avatar Aug 07 '24 22:08 scottyeager

Are you sure those nodes are updated? Can you please check their versions if possible?

rawdaGastan avatar Aug 19 '24 09:08 rawdaGastan

I have reviewed the logs for all nodes in my list above. It seems they all have some issue that's preventing them from updating.

What's common in the logs of all nodes is this line:

[+] identityd: error failed to get flist info error="failed to get flist (tf-zos/zos:production-3:latest.flist) info: 404 Not Found"

Most of the nodes also have an error about read only cache and resulting boltdb failure. For example:

[+] provisiond: fatal exiting error="error running integrity checks: unlinkat /var/cache/modules/provisiond/metrics-diff.bolt: read-only file system"

1087 1226 1479 1640 1723 2158 2723 4349

A couple don't have the read only cache error but instead have an error regarding tfchain, like this:

[+] noded:  error failed to decode events from tfchain error="unable to find field Balances_Locked for event #62 with EventID [20 17]"

1926 1966

Checking now, I see that there's a fix for nodes with read only cache not getting the latest version.

But what about those last two nodes? They are not reporting read only cache, but it seems they have a similar behavior in not accepting the latest version.

scottyeager avatar Aug 21 '24 22:08 scottyeager

Here's a new list of nodes that appear to be stuck without an update. I didn't check all nodes ids listed above to see if they're still affected. I just noticed that these nodes have an RMB relay of relay.grid.tf when updated nodes are using redundant relays, relay.02.grid.tf_relay.grid.tf.

{'nodeID': 943, 'twinID': 2086} {'twinID': 2086, 'relay': 'relay.grid.tf'}
{'nodeID': 1640, 'twinID': 2891} {'twinID': 2891, 'relay': 'relay.grid.tf'}
{'nodeID': 2158, 'twinID': 3696} {'twinID': 3696, 'relay': 'relay.grid.tf'}
{'nodeID': 1966, 'twinID': 3421} {'twinID': 3421, 'relay': 'relay.grid.tf'}
{'nodeID': 1226, 'twinID': 2369} {'twinID': 2369, 'relay': 'relay.grid.tf'}
{'nodeID': 1479, 'twinID': 2634} {'twinID': 2634, 'relay': 'relay.grid.tf'}
{'nodeID': 6030, 'twinID': 10039} {'twinID': 10039, 'relay': 'relay.grid.tf'}

One new behavior I noticed while checking logs for node 943 is that it's 404ing on what I guess is an old Flist Hub link for a Zos update:

[+] identityd: 2024-12-31T18:29:04Z error failed to get flist info error="failed to get flist (tf-zos/zos:production-3:latest.flist) info: 404 Not Found"

scottyeager avatar Dec 31 '24 18:12 scottyeager

@scottyeager still reproducible?

rawdaGastan avatar Jan 19 '25 10:01 rawdaGastan