pytest-timeout Report timeouts with a TIMEOUT status instead of FAILED

I am the author of PR qtile/qtile#2331 and want to add pytest-timeout to the qtile test suite. The change I made to the conftest.py was to implement a TIMEOUT status plus inline timings to make it more obvious which tests fail and which ones time out. Two of the maintainers commented that they'd rather prefer it to have this functionality built into pytest-timeout and I agreed. So my shot at it is here and we'd love to get some feedback if something like that could be added to pytest-timeout.

Example output, luckily without `TIMEOUT` or `FAILED (...)`:

  test/test_bar.py::test_basic PASSED (0.001)                              [  2%]
  test/test_bar.py::test_basic PASSED (0.116)                              [  2%]
  test/test_bar.py::test_basic PASSED (0.036)                              [  2%]
  test/test_bar.py::test_singlespacer PASSED (0.001)                       [  2%]
  test/test_bar.py::test_singlespacer PASSED (0.114)                       [  2%]
  test/test_bar.py::test_singlespacer PASSED (0.035)                       [  2%]
  test/test_bar.py::test_nospacer PASSED (0.001)                           [  2%]
  test/test_bar.py::test_nospacer PASSED (0.115)                           [  2%]
  test/test_bar.py::test_nospacer PASSED (0.034)                           [  2%]
  test/test_bar.py::test_configure_broken_widgets PASSED (0.001)           [  2%]
  test/test_bar.py::test_configure_broken_widgets PASSED (0.112)           [  2%]
  test/test_bar.py::test_configure_broken_widgets PASSED (0.039)           [  2%]
  test/test_bar.py::test_bar_hide_show_with_margin PASSED (0.001)          [  2%]
  test/test_bar.py::test_bar_hide_show_with_margin PASSED (0.207)          [  2%]
  test/test_bar.py::test_bar_hide_show_with_margin PASSED (0.035)          [  2%]
  test/test_check.py::test_check_default_config SKIPPED (needs mypy)       [  3%]
  test/test_check.py::test_check_default_config PASSED (0.000)             [  3%]
  test/test_check.py::test_check_bad_syntax SKIPPED (needs mypy)           [  3%]
  test/test_check.py::test_check_bad_syntax PASSED (0.000)                 [  3%]

Mar 31 '21 12:03 ep12

A custom status could be nice. Probably should have a better mechanism to convey the test status instead of the current grepping for Timeout though, which you could do if you did this in timeout itself.

What other implications does introducing a new status have? You mention something about durations? How are other ways we could be accidentally break test suites by introducing this? Would the suite still fail if there's a timeout?

Apr 11 '21 20:04 flub

A custom status could be nice. Probably should have a better mechanism to convey the test status instead of the current grepping for Timeout though, which you could do if you did this in timeout itself.

The up-to-date and original shots I had print TIMEOUT or T, respectively.

What other implications does introducing a new status have?

I am not sure if that has the capability to break something, but I don't think so. The category returned in case of a timeout is failed, so everything should still work.

You mention something about durations?

Yep, I got confused with the call_only option. We probably want a timeout for the complete test including setup and teardown. But when a timeout occurred and the duration of the test got printed, it was the call duration only which was smaller than the timeout. That was confusing. I got this right now but this has increased the complexity a bit. All the timings are now printed as a sum:

  test/layouts/test_common.py::test_remove_floating[Stack] PASSED (0.111+1.051+0.041=1.203) [ 77%]
  test/layouts/test_common.py::test_remove_floating[Tile] PASSED (0.112+1.052+0.039=1.203) [ 77%]
  test/layouts/test_common.py::test_remove_floating[TreeTab] PASSED (0.119+1.136+0.052=1.308) [ 77%]
  test/layouts/test_common.py::test_remove_floating[VerticalTile] PASSED (0.114+1.112+0.040=1.266) [ 77%]
  test/layouts/test_common.py::test_remove_floating[Zoomy] PASSED (0.110+1.037+0.040=1.187) [ 78%]
  test/layouts/test_common.py::test_desktop_notifications[Bsp] PASSED (0.110+1.171+0.039=1.321) [ 78%]
  test/layouts/test_common.py::test_desktop_notifications[Columns] PASSED (0.111+1.329+0.041=1.482) [ 78%]
  test/layouts/test_common.py::test_desktop_notifications[Floating] PASSED (0.110+1.243+0.040=1.393) [ 78%]
  test/layouts/test_common.py::test_desktop_notifications[Matrix] PASSED (0.110+1.286+0.040=1.436) [ 78%]
  test/layouts/test_common.py::test_desktop_notifications[Max] PASSED (0.112+1.186+0.038=1.336) [ 79%]
  test/layouts/test_common.py::test_desktop_notifications[MonadTall] PASSED (0.111+1.172+0.040=1.323) [ 79%]

Looks a lot nicer in a proper terminal of course, this is the output from our CI setup.

How are other ways we could be accidentally break test suites by introducing this?

This could of course break test suites if exceptions occur when running those functions. I tried to make everything as exception-proof as in _pytest.runner, so checking with hasattr etc. should already be enough to avoid errors.

I am pretty confident that pytest_runtest_makereport cannot fail
pytest_report_teststatus is close to the _pytest.runner original and could only fail if
- pytest overrides the custom timings attribute. This could be prevented by choosing a more explicit name, e.g. _timeout_plugin_timings or so.
- report.duration is set and is not a number (does not support addition) for at least one report
- report has no attribute timed_out (which should always be set by pytest_runtest_makereport. This one can easily be fixed with getattr, I'll do this later. The links in this section right here are permalinks, so check out the latest changes to that branch yourself)
- report is missing passed, failed or skipped, e.g. it isn't a TestReport instance. That would break the _pytest.runner implementation, too.
pytest_runtest_protocol could only fail if
- reports = _pytest.runner.runtestprotocol(...) (internal stuff is bad, I know, but this sadly cannot be avoided when printing the timings after all phases have been run) returns something that is bool(reports) = True but not iterable. The return annotation of that internal function is -> List[TestReport].
- report is missing passed, failed or skipped, e.g. it isn't a TestReport instance. That would break the _pytest.runner implementation, too.

All of those possible issues are either breaking _pytest itself, are highly unlikely (e.g. why would pytest set a report.duration = tuple() or something else that cannot be added together?) or easy to spot (runtestprotocol changes its return type).

When printing the timing information is too verbose or too far off from the current pytest behaviour, this can of course be removed. In that case pytest_runtest_protocol would not be required anymore and the internal _pytest.runner.runtestprotocol dependency would be gone. Inline timings are definitely a nice-to-have because when running pytest -x --durations=0 and a timeout occurs, the duration statistics might not be printed (method=thread). Also, the --durations are separated by phases, so up to three different lines for the same test function.

Would the suite still fail if there's a timeout?

Yes they do still fail. At least I haven't seen a test suite that timed out and got an overall PASS.

Apr 12 '21 10:04 ep12

I threw this at one of my personal projects and forced some timeouts:

...
test/unit_systems/abstract/test_type_utils.py::test_value_matches_annotation PASSED (0+0+0=0)                                                                                           [ 46%]
test/unit_systems/abstract/test_type_utils.py::test_overload PASSED (0+0.005+0=0.005)                                                                                                   [ 53%]
test/unit_systems/abstract/test_type_utils.py::test_typeguard TIMEOUT (0+5.000=5.001)                                                                                                   [ 60%]
test/unit_systems/abstract/test_type_utils.py::test_format_a_kw TIMEOUT (0+4.999+0=5.000)                                                                                               [ 66%]
test/unit_systems/si/test_si.py::test_unit_system FAILED (0+0.069+0=0.069)                                                                                                              [ 73%]
test/unit_systems/si/test_si.py::test_base_units FAILED (0+0.003+0=0.004)                                                                                                               [ 80%]
test/unit_systems/si/test_si.py::test_derived_units FAILED (0+0.005+0=0.006)                                                                                                            [ 86%]
test/unit_systems/si/test_si.py::test_prefixes FAILED (0+0.002+0=0.003)                                                                                                                 [ 93%]
test/unit_systems/si/test_si.py::test_number_prefix_product FAILED (0+0.005+0=0.005)                                                                                                    [100%]

The first timeout occurs during the call phase (s+c=total) and everything works out fine. The second one occurred during the teardown phase (s+c+t=total). Here things start to become wobbly:

traceback

______________________________________________________________________________________ test_unit_system _______________________________________________________________________________________
cls = , func = . at 0x7f9a6e854ee0>, when = 'teardown'
reraise = (, )
    @classmethod
    def from_call(
        cls,
        func: "Callable[[], TResult]",
        when: "Literal['collect', 'setup', 'call', 'teardown']",
        reraise: Optional[
            Union[Type[BaseException], Tuple[Type[BaseException], ...]]
        ] = None,
    ) -> "CallInfo[TResult]":
        excinfo = None
        start = timing.time()
        precise_start = timing.perf_counter()
        try:
>           result: Optional[TResult] = func()
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:311: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:255: in 
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
.tox/py39/lib/python3.9/site-packages/pluggy/hooks.py:286: in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
.tox/py39/lib/python3.9/site-packages/pluggy/manager.py:93: in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
.tox/py39/lib/python3.9/site-packages/pluggy/manager.py:84: in 
    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:175: in pytest_runtest_teardown
    item.session._setupstate.teardown_exact(item, nextitem)
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:419: in teardown_exact
    self._teardown_towards(needed_collectors)
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:434: in _teardown_towards
    raise exc
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:427: in _teardown_towards
    self._pop_and_teardown()
.tox/py39/lib/python3.9/site-packages/_pytest/runner.py:387: in _pop_and_teardown
    self._teardown_with_finalization(colitem)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <_pytest.runner.setupstate object at>, colitem = 
    def _teardown_with_finalization(self, colitem) -> None:
        self._callfinalizers(colitem)
        colitem.teardown()
        for colitem in self._finalizers:
>           assert colitem in self.stack
E           AssertionError

All the other failed tests have the exact same traceback.

The timeout somehow messed with `pytest` as these tests succeed if I remove that second timeout:

...
test/unit_systems/abstract/test_type_utils.py::test_value_matches_annotation PASSED (0+0+0=0)                                                                                           [ 46%]
test/unit_systems/abstract/test_type_utils.py::test_overload PASSED (0+0.011+0=0.011)                                                                                                   [ 53%]
test/unit_systems/abstract/test_type_utils.py::test_typeguard TIMEOUT (0+5.000=5.000)                                                                                                   [ 60%]
test/unit_systems/abstract/test_type_utils.py::test_format_a_kw PASSED (0+0+0=0)                                                                                                        [ 66%]
test/unit_systems/si/test_si.py::test_unit_system PASSED (0+0.073+0=0.073)                                                                                                              [ 73%]
test/unit_systems/si/test_si.py::test_base_units PASSED (0+0.003+0=0.003)                                                                                                               [ 80%]
test/unit_systems/si/test_si.py::test_derived_units PASSED (0+0.006+0=0.006)                                                                                                            [ 86%]
test/unit_systems/si/test_si.py::test_prefixes PASSED (0+0.002+0=0.002)                                                                                                                 [ 93%]
test/unit_systems/si/test_si.py::test_number_prefix_product PASSED (0+0.004+0=0.005)                                                                                                    [100%]

With that single timeout, the test suite still fails:

=================================================================================== short test summary info ===================================================================================
TIMEOUT (0+5.000=5.000) test/unit_systems/abstract/test_type_utils.py::test_typeguard - Failed: Timeout >5.0s
================================================================================ 1 failed, 14 passed in 5.43s =================================================================================
ERROR: InvocationError for command ~/git/abstract_unit_systems/.tox/py39/bin/coverage run --source=unit_systems -m pytest -W error:::unit_systems (exited with code 1)
___________________________________________________________________________________________ summary ___________________________________________________________________________________________
ERROR:   py39: commands failed

Note that the weird issue with the timeout happening during teardown came from a single lucky run. Other exceptions might or might not be thrown when repeatedly running them with sleeps.

Apr 12 '21 10:04 ep12

I forked this repo to https://www.github.com/ep12/pytest-timeout/tree/timing_enhancements and added some commits to it that contain everything that would be helpful for qtile/qtile. Example output looks like

  test/layouts/test_max.py::test_max_window_focus_cycle[MaxConfig-xephyr0] PASSED (899ms+3.92s+664ms=5.48s) [ 36%]
  test/layouts/test_ratiotile.py::test_ratiotile_add_windows[RatioTileConfig-xephyr0] PASSED (909ms+6.29s+1.02s=8.23s) [ 36%]
  test/layouts/test_ratiotile.py::test_ratiotile_add_windows_golden_ratio[RatioTileConfig-xephyr0] PASSED (916ms+6.24s+841ms=8.00s) [ 36%]

see https://github.com/ep12/qtile/tree/time_ci_runs_pytest_timeout_upstream for the options and check the github actions CI logs for more output.

Apr 12 '21 20:04 ep12

Hello again! I just want to point you to my comment in the corresponding qtile issue message thread. It looks like the method both I implemented in 99edbf04 works. Would you like to verify that?

Apr 13 '21 20:04 ep12

@flub or anyone else: Has anyone had a look at this? I think at least the first two commits (d8275117ef11001996d391d3ee66cf182d1dbd67 and b7f37f8286f4b114fb9384e834d44545eae218dd) would be nice to have. The second commit adds some functionality such that the third commit could even partially be realised in conftest.py. The first commit adds a new method that is a combination of thread and signal where the signal method is tried first and if the thread doesn't react in time, it is killed. That should theoretically combine the reliability of thread with the forgiveness of signal. That's just a proof of concept and not a complete PR of course. What are your opinions regarding this all?

May 03 '21 09:05 ep12

Hey, sorry I don't have a lot of time to spend on pytest-timeout so if no one else offers reviews I'm afraid my rather sporadic looks at pytest-timeout is all there is.

Anyway, having a quick look at those two commits they do indeed look interesting features. Would you like to open, preferably separate, PRs for them?

May 03 '21 20:05 flub

Nothing to apologise for, really! We all have a lot of stuff to do... I'll make those pr's, hopefully relatively soon... Can't promise it though... I'll probably have to look a bit more at the project structure to write the docs and tests if I can, so that might take a bit... When everything is ready for review I'll mention you so you get notified then.

May 03 '21 20:05 ep12

This looks like what I want. I hope it to print TIMEOUT if it failed because of a timeout. so how is it going now?

Nov 20 '21 04:11 ffbh123456

Still in progress at the moment...

Nov 25 '21 11:11 ep12