[flaky] network.TestDialTLSWithBackoffSuccess
=== RUN TestDialTLSWithBackoffSuccess
[247](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A247)
2022/07/27 13:44:46 http: TLS handshake error from 127.0.0.1:38562: EOF
[248](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A248)
2022/07/27 13:44:46 http: TLS handshake error from 127.0.0.1:38564: EOF
[249](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A249)
2022/07/27 13:44:47 http: TLS handshake error from 127.0.0.1:38566: EOF
[250](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A250)
transports_test.go:183: Dial error = timed out dialing 127.0.0.1:42343 after 0.20s
[251](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A251)
2022/07/27 13:44:47 http: TLS handshake error from 127.0.0.1:38568: write tcp 127.0.0.1:42343->127.0.0.1:38568: use of closed network connection
[252](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A252)
--- FAIL: TestDialTLSWithBackoffSuccess (0.22s)
This is still flaking pretty heavily
/assign @cardil
Ran into this one again this morning on #2608
Seems like there's a race happening somewhere. Running the test individually doesn't seem to cause a problem (N~=50):
$ go test ./network -count 1 -run=TestDialTLSWithBackoffSuccess -race -v
=== RUN TestDialTLSWithBackoffSuccess
--- PASS: TestDialTLSWithBackoffSuccess (0.01s)
PASS
ok knative.dev/pkg/network 0.038s
And running the full suite without the -race flag passes more often than not:
$ go test ./network/... -v
...
=== RUN TestDialTLSWithBackoffSuccess
--- PASS: TestDialTLSWithBackoffSuccess (0.01s)
...
Add the -race flag and then we start to see the failures:
$ go test ./network/ -count 1 -race
ok knative.dev/pkg/network 0.733s
$ go test ./network/ -count 1 -race
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46046: EOF
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46058: read tcp 127.0.0.1:41349->127.0.0.1:46058: read: connection reset by peer
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46070: EOF
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46072: write tcp 127.0.0.1:41349->127.0.0.1:46072: use of closed network connection
--- FAIL: TestDialTLSWithBackoffSuccess (0.20s)
transports_test.go:183: Dial error = timed out dialing 127.0.0.1:41349 after 0.20s
FAIL
FAIL knative.dev/pkg/network 0.872s
FAIL
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/lifecycle frozen
When the DialWithDialer is switched to just a non tls DialContext it does not fail https://github.com/knative/pkg/blob/main/network/transports.go#L99.
Before:
pkg◗ go test ./network/ -count 9 -race -failfast -parallel 12
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45372: EOF
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45380: read tcp 127.0.0.1:35929->127.0.0.1:45380: read: connection reset by peer
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45392: EOF
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45396: EOF
--- FAIL: TestDialTLSWithBackoffSuccess (0.29s)
transports_test.go:183: Dial error = timed out dialing 127.0.0.1:35929 after 0.29s
FAIL
FAIL knative.dev/pkg/network 0.586s
FAIL
After:
pkg◗ go test ./network/ -count 9 -race -failfast -parallel 12
ok knative.dev/pkg/network 4.982s
Note: I didn't see this issue while I was running Zoom, which has a high CPU usage. So, it seems this issue only happens when the machine has a lot of CPU resources. The issue might be because in the test itself we don't gracefully close the context.