trio icon indicating copy to clipboard operation
trio copied to clipboard

Tracking issue: intermittent test failures

Open njsmith opened this issue 8 years ago • 50 comments

It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.

This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.

But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.

What to do if you see a weird test failure that makes no sense:

  • Visit this bug; it's #200 so it's hopefully easy to remember.

  • Check to see if anyone else has reported the same failure

  • Either way, add a note recording what you saw. Make sure to link to the failed test log.

  • Special notes for specific CI services:

    • If it's a failed azure pipelines run, then go to the build page, click the "..." menu icon in the upper right, and select "Retain build". (Otherwise the logs will be deleted after 30 days or something like that.)
    • If it's a failed travis-ci run, DO NOT CLICK THE "RESTART BUILD" OR "RESTART JOB" BUTTON! That will wipe out the log and replace it with your new run, so we lose the information about what failed. Instead, close and then re-open the PR; this will tickle Travis into re-testing your commit, but in a way that gives the new build a new URL, so that the old log remains accessible.

Issues we're monitoring currently

  • flakiness in test_open_tcp_listeners_backlog: https://github.com/python-trio/trio/issues/200#issuecomment-451885495 (last seen: all the time, should be fixed by #1601)
  • #1277 (last seen: October 2019)
  • segfault in pypy 3.6 nightly after faulthandler timeout fired: https://github.com/python-trio/trio/issues/200#issuecomment-424674288 (last seen: late 2018)
  • #851 (last seen: January 2019)

njsmith avatar Jun 11 '17 09:06 njsmith

Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)

njsmith avatar Jun 17 '17 02:06 njsmith

#119 is now fixed, I think / hope. Demoting it to "on the radar".

njsmith avatar Aug 04 '17 22:08 njsmith

Got fed up and fixed #140 :-)

njsmith avatar Aug 08 '17 15:08 njsmith

#140 came back. This makes no sense at all.

njsmith avatar Aug 18 '17 08:08 njsmith

Freeze in test_local.py::test_run_local_simultaneous_runs on pypy3 5.8 – maybe just travis being flaky, maybe something more.

njsmith avatar Sep 07 '17 07:09 njsmith

Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123

It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap() raised an SSLWantWriteError. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?

njsmith avatar Nov 06 '17 23:11 njsmith

Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second. https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21 #361 attempts a diagnosis and fix.

njsmith avatar Dec 01 '17 00:12 njsmith

The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077 Again on CPython 3.6 on MacOS.

Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError there as a workaround.

Edit: #365 is the PR for ignoring it.

njsmith avatar Dec 05 '17 04:12 njsmith

Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/319497598

Same thing happened on Sept. 7, above: https://github.com/python-trio/trio/issues/200#issuecomment-327711926 And back in June: https://github.com/python-trio/trio/issues/200#issuecomment-309187983

Filed a bug: #379

njsmith avatar Dec 21 '17 02:12 njsmith

Sigh: https://github.com/python-trio/trio/issues/447

njsmith avatar Feb 21 '18 06:02 njsmith

#508

njsmith avatar Apr 21 '18 05:04 njsmith

There was a mysterious appveyor build failure here: https://github.com/python-trio/trio/pull/535

njsmith avatar May 19 '18 05:05 njsmith

Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139

Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.

(I think someone restarted the job above and it finally worked: the job is now green.)

pquentin avatar Jun 13 '18 21:06 pquentin

Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.

Fuyukai avatar Jul 29 '18 11:07 Fuyukai

Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0

The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py, which seems to be:

        self._data = immutables.Map()

And immutables.Map is a complex data structure implemented in C, so agreed that this smells like a bug in that code.

Filed a bug upstream here: https://github.com/MagicStack/immutables/issues/7

njsmith avatar Jul 29 '18 21:07 njsmith

The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.

njsmith avatar Jul 30 '18 07:07 njsmith

a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized: https://travis-ci.org/python-trio/trio/jobs/433409577

Reported it on the #pypy irc channel anyway, though there's not much to go on yet

njsmith avatar Sep 26 '18 11:09 njsmith

Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955

I don't really understand this one at all.

njsmith avatar Oct 04 '18 05:10 njsmith

An appveyor failure, in test_open_tcp_listeners_backlog: https://ci.appveyor.com/project/njsmith/trio/builds/21430330/job/orp078dryaqrninr

Apparently we set the nominal backlog to 1 and the measured backlog was 15, while we assert that it's <= 10.

Edit (2019-02-05): another failure in the backlog test, this time on macOS: https://dev.azure.com/python-trio/trio/_build/results?buildId=150&view=logs

We asked for 11, but got 5.

njsmith avatar Jan 07 '19 10:01 njsmith

A failure in test_subprocess.test_signals on macOS: #851

njsmith avatar Jan 11 '19 03:01 njsmith

A one second sleep took 1.39 seconds on azure pipelines's macos py35: https://dev.azure.com/python-trio/trio/_build/results?buildId=598&view=logs

I guess we need to bump up that threshold...

njsmith avatar Jun 04 '19 03:06 njsmith

Azure Pipelines macOS 3.5 strikes again: https://dev.azure.com/python-trio/trio/_build/results?buildId=806&view=logs&jobId=93056758-5bfb-5750-f113-e720ddefdb4c

~~Here we tried sleeping for 1 second, and the process actually slept for >2 seconds.~~ [Edit: Actually we tried sleeping for 0.1 second, and the process actually slept for >0.2 seconds.] So our automatic retry process kicked in... and it happened 4 times in a row, so the test still failed.

njsmith avatar Jul 27 '19 08:07 njsmith

Maybe we can make those sleep tests more robust by not just retrying, but first sleeping a bit before retrying, in order to give any noisy neighbors a chance to settle down?

njsmith avatar Jul 27 '19 08:07 njsmith

And again: https://dev.azure.com/python-trio/trio/_build/results?buildId=841&view=logs&jobId=1448839f-898c-5b46-2db8-b1e2e0090bd8

njsmith avatar Aug 02 '19 05:08 njsmith

I've seen https://github.com/python-trio/trio/issues/1170 again.

pquentin avatar Jan 31 '20 09:01 pquentin

https://dev.azure.com/python-trio/trio/_build/results?buildId=1498&view=logs&jobId=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&j=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&t=541e27d3-d681-5791-1ce2-84a4e949a5a6 failed:

+ mkdir empty
mkdir: cannot create directory 'empty': File exists

I see this all the time locally when running ci.sh multiple times, but the empty directory is not supposed to exist in our continuous integration. Maybe there's some sort of cache?

pquentin avatar Feb 11 '20 05:02 pquentin

mkdir: cannot create directory 'empty': File exists

Uh... there isn't supposed to be any kind of cache... azure does have a caching feature, but you have to enable it explicitly and tell it to cache specific things; it shouldn't be caching random files inside the source checkout. This looks super broken.

I guess we could attempt to debug it by adding some ls calls and waiting to see if it happens again. But if this is also causing trouble for you locally, maybe the pragmatic thing would be to make it rm -rf empty; mkdir empty and not worry about it?

njsmith avatar Feb 11 '20 12:02 njsmith

mkdir: cannot create directory 'empty': File exists strikes again: https://github.com/python-trio/trio/pull/1396#issuecomment-588316930

njsmith avatar Feb 19 '20 18:02 njsmith

What if we change it to mkdir -p?

       -p, --parents
              no error if existing, make parent directories as needed

wgwz avatar Feb 20 '20 23:02 wgwz

Sure, that might work too. We're just flailing in the dark here, so it's probably too optimistic to hope we can make rational fine-grained choices between similar options :-)

njsmith avatar Feb 21 '20 10:02 njsmith