pkg icon indicating copy to clipboard operation
pkg copied to clipboard

[flaky] network.TestDialTLSWithBackoffSuccess

Open dprotaso opened this issue 3 years ago • 2 comments

=== RUN   TestDialTLSWithBackoffSuccess
[247](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A247)
2022/07/27 13:44:46 http: TLS handshake error from 127.0.0.1:38562: EOF
[248](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A248)
2022/07/27 13:44:46 http: TLS handshake error from 127.0.0.1:38564: EOF
[249](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A249)
2022/07/27 13:44:47 http: TLS handshake error from 127.0.0.1:38566: EOF
[250](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A250)
    transports_test.go:183: Dial error = timed out dialing 127.0.0.1:42343 after 0.20s
[251](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A251)
2022/07/27 13:44:47 http: TLS handshake error from 127.0.0.1:38568: write tcp 127.0.0.1:42343->127.0.0.1:38568: use of closed network connection
[252](https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2362/unit-tests_pkg_main/1552288716540612608#1:build-log.txt%3A252)
--- FAIL: TestDialTLSWithBackoffSuccess (0.22s)

dprotaso avatar Jul 27 '22 14:07 dprotaso

This is still flaking pretty heavily

/assign @cardil

dprotaso avatar Aug 15 '22 21:08 dprotaso

Ran into this one again this morning on #2608

Seems like there's a race happening somewhere. Running the test individually doesn't seem to cause a problem (N~=50):

$ go test ./network -count 1 -run=TestDialTLSWithBackoffSuccess -race -v
=== RUN   TestDialTLSWithBackoffSuccess
--- PASS: TestDialTLSWithBackoffSuccess (0.01s)
PASS
ok      knative.dev/pkg/network 0.038s

And running the full suite without the -race flag passes more often than not:

$ go test ./network/... -v
...
=== RUN   TestDialTLSWithBackoffSuccess
--- PASS: TestDialTLSWithBackoffSuccess (0.01s)
...

Add the -race flag and then we start to see the failures:

$ go test ./network/ -count 1 -race 
ok      knative.dev/pkg/network 0.733s

$ go test ./network/ -count 1 -race 
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46046: EOF
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46058: read tcp 127.0.0.1:41349->127.0.0.1:46058: read: connection reset by peer
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46070: EOF
2022/10/10 11:49:08 http: TLS handshake error from 127.0.0.1:46072: write tcp 127.0.0.1:41349->127.0.0.1:46072: use of closed network connection
--- FAIL: TestDialTLSWithBackoffSuccess (0.20s)
    transports_test.go:183: Dial error = timed out dialing 127.0.0.1:41349 after 0.20s
FAIL
FAIL    knative.dev/pkg/network 0.872s
FAIL

psschwei avatar Oct 10 '22 15:10 psschwei

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 09 '23 01:01 github-actions[bot]

/lifecycle frozen

dprotaso avatar Feb 14 '23 16:02 dprotaso

When the DialWithDialer is switched to just a non tls DialContext it does not fail https://github.com/knative/pkg/blob/main/network/transports.go#L99.

Before:

pkg◗ go test ./network/ -count 9 -race -failfast -parallel 12
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45372: EOF
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45380: read tcp 127.0.0.1:35929->127.0.0.1:45380: read: connection reset by peer
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45392: EOF
2023/03/02 17:25:10 http: TLS handshake error from 127.0.0.1:45396: EOF
--- FAIL: TestDialTLSWithBackoffSuccess (0.29s)
    transports_test.go:183: Dial error = timed out dialing 127.0.0.1:35929 after 0.29s
FAIL
FAIL	knative.dev/pkg/network	0.586s
FAIL

After:

pkg◗ go test ./network/ -count 9 -race -failfast -parallel 12
ok  	knative.dev/pkg/network	4.982s

Note: I didn't see this issue while I was running Zoom, which has a high CPU usage. So, it seems this issue only happens when the machine has a lot of CPU resources. The issue might be because in the test itself we don't gracefully close the context.

krsna-m avatar Mar 10 '23 16:03 krsna-m