simplecov icon indicating copy to clipboard operation
simplecov copied to clipboard

Race Condition in parallel_tests integration: an outdated report may be written

Open PragTob opened this issue 6 years ago • 5 comments

Discovered/verified in #814

Basically apparently it can happen that we generate the correct report but override it with an outdated one. Discovered running the parallel tests part of the features introduced in #841 (currently tagged wip to avoid flakies, remove the wip tag if you want to test this out). It doesn't happen reliably. Might happen 3 times in a row or not at all for 20 runs (for me at least).

I gathered some examples here: https://github.com/colszowka/simplecov/pull/814#issuecomment-573313375

What you can see there is something like this:

Coverage report generated for (2/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 23 / 33 LOC (69.7%) covered.
Coverage report generated for (1/8), (2/8), (3/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 37 / 43 LOC (86.05%) covered.
Coverage report generated for (1/8), (2/8), (3/8), (4/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 42 / 47 LOC (89.36%) covered.
Coverage report generated for (1/8), (2/8) to /home/tobi/github/simplecov/tmp/aruba/project/coverage. 32 / 39 LOC (82.05%) covered.

Notice that the second too last has the right number of loc and coverage expected in the test and also seems to take everything into account (unsure why it says n/8 though, guess: I could run 8 in parallel but parallel tests only spawned 4 as there are only 4 test files) but is seemingly later overridden by an outdated report.

If possible we should of course find and fix that race condition, and then remove the wip tags to always run these tests!

PragTob avatar Jan 11 '20 13:01 PragTob

+1 here

nickshatilo avatar Feb 16 '20 17:02 nickshatilo

Does bumping up merge_timeout mitigate this? It looks like the number of tests involved is small enough that the default timeout of 10 minutes should be more than enough. Just curious whether that's a factor.

I ran into the following case with some long running test suites:

[osm.jruby] Using 'RSpecdevn3' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn1' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn4' as SimpleCov command name
[osm.jruby] Using 'RSpecdevn2' as SimpleCov command name
[osm.jruby] Coverage report generated for RSpecdevn3 to rails/coverage. 44377 / 168731 LOC (26.3%) covered.
[osm.jruby] Coverage report generated for RSpecdevn2, RSpecdevn3 to rails/coverage. 69229 / 156447 LOC (44.25%) covered.
[osm.jruby] Coverage report generated for RSpecdevn1, RSpecdevn2 to rails/coverage. 73551 / 150657 LOC (48.82%) covered.
[osm.jruby] Coverage report generated for RSpecdevn1, RSpecdevn2, RSpecdevn4 to ...rails/coverage. 86913 / 144881 LOC (59.99%) covered.

and the fix, in my case, was to bump up merge_timeout.

That might be a separate issue with similar symptoms. Mentioning it here in case others googling for this fall into the same case.

benissimo avatar Feb 19 '20 17:02 benissimo

@benissimo similar symptoms, separate issue (more user configuration)

In the local test case the tests finish within ~2 seconds so it's way beyond being impacted by the merge timeout.

PragTob avatar Feb 19 '20 18:02 PragTob

This might be related to us not doing correct checking which I ask/discuss in https://github.com/grosser/parallel_tests/issues/772 - after that is resolved and changes are implemented this should be worth re checking.

PragTob avatar Aug 11 '20 15:08 PragTob

It seems to be better now but not fixed. CI now occasionally fails with:

(::) failed steps (::)

Expected `bundle exec parallel_rspec spec` to succeed but got non-zero exit status and the following output:

2 processes for 4 specs, ~ 2 specs per process
...

Finished in 0.02936 seconds (files took 0.67039 seconds to load)
3 examples, 0 failures

...

Finished in 0.02201 seconds (files took 0.6368 seconds to load)
3 examples, 0 failures

Coverage report generated for (2/2) to /home/runner/work/simplecov/simplecov/tmp/aruba/project/coverage. 28 / 37 LOC (75.68%) covered.

which funnily enough happens on JRuby the 2-3 times I've seen it and only that spec. Needs a fix soon-ish or a wip on the cuke as not to be annoying/make people think they did anything wrong.

PragTob avatar Aug 16 '20 18:08 PragTob