roachtest: schemachange/leasing-benchmark failed [azure; n2 failed to start due to connection refused error]
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 16d41751607b92234351c1ab27053c3875a4f2b7:
(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-38620
It appears that n2 failed to ever startup, due to connectivity issues in the cluster
W240510 13:59:49.014602 15 gossip/client.go:121 ⋮ [T1,Vsystem,n2] 48 failed to start gossip client to ‹40.76.187.244:26257›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:49.014641 16 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 49 failed connection attempt‹ (last connected 0s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:50.010528 188 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 50 failed connection attempt‹ (last connected 996ms ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
I240510 13:59:51.877241 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 51 unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded
I240510 13:59:52.875722 339 gossip/client.go:127 ⋮ [T1,Vsystem,n2] 52 started gossip client to n0 (‹40.76.187.244:26257›)
I240510 13:59:52.890874 143 1@server/server.go:1791 ⋮ [T1,Vsystem,n2] 53 node connected via gossip
I240510 13:59:52.891410 90 kv/kvserver/stores.go:283 ⋮ [T1,Vsystem,n2] 54 wrote 1 node addresses to persistent storage
I240510 13:59:52.891555 339 gossip/client.go:136 ⋮ [T1,Vsystem,n2] 55 closing client to n1 (‹40.76.187.244:26257›): recv msg error: grpc: ‹duplicate connection from node at 10.2.0.10:26257› [code 2/Unknown]
E240510 13:59:53.162512 315 2@rpc/peer.go:577 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 56 disconnected (was healthy for 1.016s): grpc: ‹initial connection heartbeat failed: grpc: client requested node ID 2 doesn't match server node ID 3 [code 2/Unknown]› [code 2/Unknown]
I240510 13:59:54.878328 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 57 unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded
I'll move this to TestEng, in case this is something worth investigating in the new Azure infra. Otherwise, feel free to close this as a non-actionable flake.
cc @cockroachdb/test-eng
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4c2e7761acd050aaee565443932b6b0eca55620b:
(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4cc0bfcc14771331fea57de01e1ea78b07393f3d:
(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:
(test_runner.go:1243).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ d146ecff6f687e438706cf63591cafca60cc116d:
(test_runner.go:1253).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
- #124463 roachtest: schemachange/leasing-benchmark failed [C-test-failure O-roachtest O-robot T-sql-foundations branch-release-24.1.0-rc release-blocker]
roachtest.schemachange/leasing-benchmark failed with artifacts on master @ c580e634736b2d2b6da544eecf16664d3caca740:
(test_runner.go:1255).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1
Parameters:
-
ROACHTEST_arch=arm64 -
ROACHTEST_cloud=azure -
ROACHTEST_coverageBuild=false -
ROACHTEST_cpu=4 -
ROACHTEST_encrypted=false -
ROACHTEST_fs=ext4 -
ROACHTEST_localSSD=true -
ROACHTEST_metamorphicBuild=false -
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Looks like this is failing every time, but is usually skipped because Azure doesn't have enough capacity. Seeing this quite often for westus2.
compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="SkuNotAvailable" Message="The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_D4ds_v5' is currently not available in location 'westus2'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details." Target="vmSize"
Looks like the actual issue though is that roachprod doesn't support geo dist clusters for Azure yet. I tried adding support but ran into further issues with how we handle network peering that seemed non trivial to fix. I think I'll put out a PR to:
- Disable this test on Azure.
- Switch the default location from
westus2towestus3. - Make an issue to support geo zones for Azure.