Tracking issue: intermittent test failures
It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.
This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.
But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.
What to do if you see a weird test failure that makes no sense:
-
Visit this bug; it's #200 so it's hopefully easy to remember.
-
Check to see if anyone else has reported the same failure
-
Either way, add a note recording what you saw. Make sure to link to the failed test log.
-
Special notes for specific CI services:
- If it's a failed azure pipelines run, then go to the build page, click the "..." menu icon in the upper right, and select "Retain build". (Otherwise the logs will be deleted after 30 days or something like that.)
- If it's a failed travis-ci run, DO NOT CLICK THE "RESTART BUILD" OR "RESTART JOB" BUTTON! That will wipe out the log and replace it with your new run, so we lose the information about what failed. Instead, close and then re-open the PR; this will tickle Travis into re-testing your commit, but in a way that gives the new build a new URL, so that the old log remains accessible.
Issues we're monitoring currently
- flakiness in
test_open_tcp_listeners_backlog: https://github.com/python-trio/trio/issues/200#issuecomment-451885495 (last seen: all the time, should be fixed by #1601) - #1277 (last seen: October 2019)
- segfault in pypy 3.6 nightly after faulthandler timeout fired: https://github.com/python-trio/trio/issues/200#issuecomment-424674288 (last seen: late 2018)
- #851 (last seen: January 2019)
Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)
#119 is now fixed, I think / hope. Demoting it to "on the radar".
Got fed up and fixed #140 :-)
#140 came back. This makes no sense at all.
Freeze in test_local.py::test_run_local_simultaneous_runs on pypy3 5.8 – maybe just travis being flaky, maybe something more.
Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123
It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap() raised an SSLWantWriteError. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?
Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second. https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21 #361 attempts a diagnosis and fix.
The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077 Again on CPython 3.6 on MacOS.
Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError there as a workaround.
Edit: #365 is the PR for ignoring it.
Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/319497598
Same thing happened on Sept. 7, above: https://github.com/python-trio/trio/issues/200#issuecomment-327711926 And back in June: https://github.com/python-trio/trio/issues/200#issuecomment-309187983
Filed a bug: #379
Sigh: https://github.com/python-trio/trio/issues/447
#508
There was a mysterious appveyor build failure here: https://github.com/python-trio/trio/pull/535
Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139
Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.
(I think someone restarted the job above and it finally worked: the job is now green.)
Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.
Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0
The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py, which seems to be:
self._data = immutables.Map()
And immutables.Map is a complex data structure implemented in C, so agreed that this smells like a bug in that code.
Filed a bug upstream here: https://github.com/MagicStack/immutables/issues/7
The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.
a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized: https://travis-ci.org/python-trio/trio/jobs/433409577
Reported it on the #pypy irc channel anyway, though there's not much to go on yet
Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955
I don't really understand this one at all.
An appveyor failure, in test_open_tcp_listeners_backlog: https://ci.appveyor.com/project/njsmith/trio/builds/21430330/job/orp078dryaqrninr
Apparently we set the nominal backlog to 1 and the measured backlog was 15, while we assert that it's <= 10.
Edit (2019-02-05): another failure in the backlog test, this time on macOS: https://dev.azure.com/python-trio/trio/_build/results?buildId=150&view=logs
We asked for 11, but got 5.
A failure in test_subprocess.test_signals on macOS: #851
A one second sleep took 1.39 seconds on azure pipelines's macos py35: https://dev.azure.com/python-trio/trio/_build/results?buildId=598&view=logs
I guess we need to bump up that threshold...
Azure Pipelines macOS 3.5 strikes again: https://dev.azure.com/python-trio/trio/_build/results?buildId=806&view=logs&jobId=93056758-5bfb-5750-f113-e720ddefdb4c
~~Here we tried sleeping for 1 second, and the process actually slept for >2 seconds.~~ [Edit: Actually we tried sleeping for 0.1 second, and the process actually slept for >0.2 seconds.] So our automatic retry process kicked in... and it happened 4 times in a row, so the test still failed.
Maybe we can make those sleep tests more robust by not just retrying, but first sleeping a bit before retrying, in order to give any noisy neighbors a chance to settle down?
And again: https://dev.azure.com/python-trio/trio/_build/results?buildId=841&view=logs&jobId=1448839f-898c-5b46-2db8-b1e2e0090bd8
I've seen https://github.com/python-trio/trio/issues/1170 again.
https://dev.azure.com/python-trio/trio/_build/results?buildId=1498&view=logs&jobId=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&j=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&t=541e27d3-d681-5791-1ce2-84a4e949a5a6 failed:
+ mkdir empty
mkdir: cannot create directory 'empty': File exists
I see this all the time locally when running ci.sh multiple times, but the empty directory is not supposed to exist in our continuous integration. Maybe there's some sort of cache?
mkdir: cannot create directory 'empty': File exists
Uh... there isn't supposed to be any kind of cache... azure does have a caching feature, but you have to enable it explicitly and tell it to cache specific things; it shouldn't be caching random files inside the source checkout. This looks super broken.
I guess we could attempt to debug it by adding some ls calls and waiting to see if it happens again. But if this is also causing trouble for you locally, maybe the pragmatic thing would be to make it rm -rf empty; mkdir empty and not worry about it?
mkdir: cannot create directory 'empty': File exists strikes again: https://github.com/python-trio/trio/pull/1396#issuecomment-588316930
What if we change it to mkdir -p?
-p, --parents
no error if existing, make parent directories as needed
Sure, that might work too. We're just flailing in the dark here, so it's probably too optimistic to hope we can make rational fine-grained choices between similar options :-)