cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: schemachange/leasing-benchmark failed [azure; n2 failed to start due to connection refused error]

Open cockroach-teamcity opened this issue 1 year ago • 5 comments

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 16d41751607b92234351c1ab27053c3875a4f2b7:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-38620

cockroach-teamcity avatar May 10 '24 16:05 cockroach-teamcity

It appears that n2 failed to ever startup, due to connectivity issues in the cluster

W240510 13:59:49.014602 15 gossip/client.go:121 ⋮ [T1,Vsystem,n2] 48  failed to start gossip client to ‹40.76.187.244:26257›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:49.014641 16 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 49  failed connection attempt‹ (last connected 0s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:50.010528 188 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 50  failed connection attempt‹ (last connected 996ms ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
I240510 13:59:51.877241 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 51  unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded
I240510 13:59:52.875722 339 gossip/client.go:127 ⋮ [T1,Vsystem,n2] 52  started gossip client to n0 (‹40.76.187.244:26257›)
I240510 13:59:52.890874 143 1@server/server.go:1791 ⋮ [T1,Vsystem,n2] 53  node connected via gossip
I240510 13:59:52.891410 90 kv/kvserver/stores.go:283 ⋮ [T1,Vsystem,n2] 54  wrote 1 node addresses to persistent storage
I240510 13:59:52.891555 339 gossip/client.go:136 ⋮ [T1,Vsystem,n2] 55  closing client to n1 (‹40.76.187.244:26257›): recv msg error: grpc: ‹duplicate connection from node at 10.2.0.10:26257› [code 2/Unknown]
E240510 13:59:53.162512 315 2@rpc/peer.go:577 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 56  disconnected (was healthy for 1.016s): grpc: ‹initial connection heartbeat failed: grpc: client requested node ID 2 doesn't match server node ID 3 [code 2/Unknown]› [code 2/Unknown]
I240510 13:59:54.878328 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 57  unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded

I'll move this to TestEng, in case this is something worth investigating in the new Azure infra. Otherwise, feel free to close this as a non-actionable flake.

rafiss avatar May 10 '24 16:05 rafiss

cc @cockroachdb/test-eng

blathers-crl[bot] avatar May 10 '24 16:05 blathers-crl[bot]

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4c2e7761acd050aaee565443932b6b0eca55620b:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

cockroach-teamcity avatar May 12 '24 14:05 cockroach-teamcity

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4cc0bfcc14771331fea57de01e1ea78b07393f3d:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

cockroach-teamcity avatar May 13 '24 13:05 cockroach-teamcity

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:

(test_runner.go:1243).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

cockroach-teamcity avatar May 15 '24 15:05 cockroach-teamcity

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ d146ecff6f687e438706cf63591cafca60cc116d:

(test_runner.go:1253).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

  • #124463 roachtest: schemachange/leasing-benchmark failed [C-test-failure O-roachtest O-robot T-sql-foundations branch-release-24.1.0-rc release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity avatar May 21 '24 14:05 cockroach-teamcity

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ c580e634736b2d2b6da544eecf16664d3caca740:

(test_runner.go:1255).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

cockroach-teamcity avatar May 23 '24 14:05 cockroach-teamcity

Looks like this is failing every time, but is usually skipped because Azure doesn't have enough capacity. Seeing this quite often for westus2.

compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="SkuNotAvailable" Message="The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_D4ds_v5' is currently not available in location 'westus2'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details." Target="vmSize"

Looks like the actual issue though is that roachprod doesn't support geo dist clusters for Azure yet. I tried adding support but ran into further issues with how we handle network peering that seemed non trivial to fix. I think I'll put out a PR to:

  1. Disable this test on Azure.
  2. Switch the default location from westus2 to westus3.
  3. Make an issue to support geo zones for Azure.

DarrylWong avatar May 23 '24 15:05 DarrylWong