UI testing randomly getting stuck in CI (OSOE-464)
See e.g. this build. The NuGetTest UI testing of OSOCE got stuck and thus timed out in 30 minutes, while all the other builds worked fine. Rerunning just that build fixed the issue. This problem started way before these recent examples and the Orchard Core 1.5 upgrade. Reruns resolve the issue every time.
Other examples unrelated to the 1.5 upgrade: here, here, here, here.
Troubleshooting this is made harder by it apparently not being possible to finish the test-dotnet action and upload artifacts when the build is canceled, see https://github.com/Lombiq/GitHub-Actions/issues/77.
It seems this bug didn't solve itself in the end: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/issues/126
See comments for notes.
To be done:
- [x] Check if this only happens if tests fail and are then retired, due to
Task.Delay()potentially having an effect. - No, this may also happen if there are no failing tests, see here. - [x] Test fix for snyc-over-async in
WebApplicationFactory. See:- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/317
- https://github.com/Lombiq/UI-Testing-Toolbox/pull/243
- https://github.com/Lombiq/Helpful-Libraries/pull/172
- [x] Test input redirection.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/commit/206ff3ebc88bf955f3cc7697f78ac162aa73ad9b
- https://github.com/Lombiq/GitHub-Actions/commit/83f47fd5f25721291bc54dff7e10547df484ff67
- [x] Test disabled node reuse.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/273
- https://github.com/Lombiq/GitHub-Actions/pull/105
- [x] Test
ui-test-parallelism: 0for NuGetTest builds. Note though that this is already the case for root builds using the larger runners, and it still happened there under Ubuntu.- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/303
- [x] Test xUnit
"maxParallelThreads": -1.- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/309
- https://github.com/Lombiq/GitHub-Actions/pull/133
- [ ] Investigate hangs with dump file, see this blog post for tips.
- https://github.com/Lombiq/GitHub-Actions/pull/215
- Investigation: https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1453287080
- [x] Fix sync over async in OC RecipeHarvester
- https://github.com/Lombiq/UI-Testing-Toolbox/pull/263
- Here is an open issue for this in OC: https://github.com/OrchardCMS/OrchardCore/issues/10329
- [ ] Check all processes (eg. smtp4dev) stopped properly. Hunting for zombie processes...
- It looks like the hang timeout does not help
- Maybe we can use Wait-Process or something similar
Stuck tests outside the scope of blame-hang
| Comment | OS | Runner | Build |
|---|---|---|---|
| https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1507597413 | Windows | standard | NuGet |
| https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1458685475, https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1458696394 | Windows | standard | root |
| https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1561842178 | Windows | standard | NuGet |
I couldn't reproduce this yet with 140 builds, see the https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/workflows/build-and-test.yml?query=branch%3Aissue%2FOSOE-464. However, the currently added logging shows the details of UI test execution, even if the build is canceled due to a timeout, see here.

So, I think we should merge the new logging, and wait for the issue to randomly arise again. At that point, we'll be able to determine the cause better.
Notes:
Until now, This happened exclusively with the NuGetTest build under Windows. This is a new case that might be the same: With the new, streaming test logging, this failed for the root solution and under Ubuntu (i.e. the polar opposite...).
The log here shows that out of the 6 test projects, only tests for 2 were run: Lombiq.TrainingDemo.Tests and Lombiq.Tests.UI.Samples. There, apparently, all 26 tests have run:
2022-11-16T11:34:32.5767718Z Results File: /home/runner/work/Open-Source-Orchard-Core-Extensions/Open-Source-Orchard-Core-Extensions/test/Lombiq.UITestingToolbox/Lombiq.Tests.UI.Samples/TestResults/test-results.trx
2022-11-16T11:34:32.5768672Z
2022-11-16T11:34:32.5816191Z Test Run Successful.
2022-11-16T11:34:32.5816903Z Total tests: 26
2022-11-16T11:34:32.5818566Z Passed: 26
2022-11-16T11:34:32.5820511Z Total time: 4.9606 Minutes
The following tests I checked and indeed have run (and they all passed) before the build timed out. There are actually 28 "Finishing execution of" runs for the supposedly 26 tests!
-
AccessibilityTest.FrontendPagesShouldBeAccessible -
BasicVisualVerificationTests.VerifyHomePagex2, for Chrome and Edge -
BasicTests.LoginShouldWork -
ErrorHandlingTests.ServerSideErrorOnLoadedPageShouldHaltTest -
EmailTests.SendingTestEmailShouldWork -
MonkeyTests.TestAdminPagesAsMonkeyRecursivelyShouldWorkWithAdminUser -
SqlServerTests.TogglingFeaturesShouldWorkWithSqlServer -
TenantTests.CreatingTenantShouldWork -
AzureBlobStorageTests.AnonymousHomePageShouldExistWithAzureBlobStorage -
BasicTests.TogglingFeaturesShouldWork -
ErrorHandlingTests.BrowserLogsShouldPersist -
MonkeyTests.TestAdminBackgroundTasksAsMonkeyRecursivelyShouldWorkWithAdminUser -
BasicOrchardFeaturesTests.BasicOrchardFeaturesShouldWork- Two "Finishing execution" messages since it failed once but was retried. -
SqlServerTests.AnonymousHomePageShouldExistWithSqlServer -
BasicTests.ShortcutsShouldWork -
AzureBlobStorageTests.TogglingFeaturesShouldWorkWithAzureBlobStorage -
ErrorHandlingTests.ErrorDuringSetupShouldHaltTest -
BasicVisualVerificationTests.VerifyNavbarx2, for Chrome and Edge -
BasicTests.AnonymousHomePageShouldExist -
MonkeyTests.TestCurrentPageAsMonkeyRecursivelyShouldWorkWithAnonymousUser -
ErrorHandlingTests.ClientSideErrorOnLoadedPageShouldHaltTest -
DatabaseSnapshotTests.BasicOrchardFeaturesShouldWorkWithExistingDatabase- Two "Finishing execution" messages but not because fails and retries. It was actually started twice, 1:15 mins apart. This is because to test resume from an existing DB, it starts a test within a test. -
MonkeyTests.TestCurrentPageAsMonkeyShouldWorkWithConfiguredRandomSeed -
BasicVisualVerificationTests.VerifyBlogImage
In the end, all tests have run from the project, just a couple of them more than once. The above list contains 24 tests + 2 Edge executions (normal until now) + 2 duplicated "Finishing execution" messages.
Log: 9_Tests.txt
So, nothing strange here. It's just that nothing happens after the "Total time" line. Then we should have a "Test successful" or "Test failed". But, this never arrives, the script is stuck between this line and this one. I guess it doesn't get stuck on if ($?) but somewhere in dotnet test.
dotnet test hanging is not an issue just for us.
- https://github.com/microsoft/vstest/issues/2080 Even James Newton-King himself had this.
- For JNK this was only with failing tests. However, in our case, tests aren't failing. After
Lombiq.Tests.UI.SamplescomesLombiq.HelpfulLibraries.Tests(until the set of test projects, or at least the projects in the solution don't change, then the order should be deterministic), but it's not reached, thus unreported tests can't fail there. Nor can they inLombiq.Tests.UI.Samples, since that project completed without issues. -
noprogresslogging might work, but then we lose visible progress. Redirecting the input from/dev/nullmight solve this. See https://github.com/microsoft/vstest/issues/2080#issuecomment-539879345 for both. For that person this was only an issue under Linux and bash, so again the polar opposite of ours (for the most part). - The input redirect trick didn't help others, but more CPU cores did (the last example of our problem happened on an 4-core Ubuntu machine, but usually happens under Windows with 2 cores). Unlimited threads for xUnit might help too.
- This corresponds to the previous one: Running tests with
WebApplicationFactorycan cause them to deadlock, because there's still a sync-over-async inTestServer. Here's the fix. I experienced hangs when changing the thread count for xUnit too, see OSOE-149. But! Our tests don't hang actually, since they supposedly fully finish. However, maybe it's worth trying out since perhaps this issue started to manifest after https://github.com/Lombiq/UI-Testing-Toolbox/pull/186. Corresponding ASP.NET Core issue: https://github.com/dotnet/aspnetcore/issues/43353. - We have
GetResult()in a couple of places. The only it seems that's not inevitable is inTestReverseProxy. ButDispose(bool disposing)doesn't seem to actually be called.
- For JNK this was only with failing tests. However, in our case, tests aren't failing. After
- https://github.com/dotnet/sdk/issues/9452 This is exactly the same symptom as ours.
- A possible tip is disabling node reuse, also see the env var approach here. This comment hints that this is a bug in
Process, but still, disabling node reuse can help. https://github.com/microsoft/vstest/issues/1503 shows that this has been a known incompatibility between MSBuild anddotnet testsince 2018. - The same thing apparently happens with
dotnet runroo. - Maybe [
--blame-hang-timeout](https://learn.microsoft.com/en-us/dotnet/core/tools/dotnet-test#:~:text=blame%2Dhang.-,%2D%2Dblame%2Dhang%2Dtimeout,-%3CTIMESPAN%3E%20(Available) would be useful to figure out the cause. But then again, it's not the tests that hang. - Adding a
Microsoft.NET.Test.Sdkpackage reference to all tests might help. We already have this (for projects actually containing test classes, not UI tests extension methods).
- A possible tip is disabling node reuse, also see the env var approach here. This comment hints that this is a bug in
- https://github.com/dotnet/sdk/issues/27106 Same issue, but no new info.
- https://github.com/dotnet/aspnetcore/issues/38649 Related.
- We have a couple of
Task.Delay()s. Removing all of them is not an option, however, only inUITestExecutionSessionit could be moved outside of the app's scope.
- We have a couple of
BTW @dministro since this hanging seems to correlate with https://github.com/Lombiq/UI-Testing-Toolbox/pull/186, does something ring a bell?
Yes, i faced deadlock in NuGetTests some times on TC, but newer GHA. I investigated it, and replaced all lock() statements inside async methods using SemaphorSlim and thread safe collections. I can collect all relevant commits.
It was really simple to reproduce locally, I started the NuGetTests like TC does, and usually the deadlock come up after 2-3 runs. After I removed lock() the way as I mentioned above, the issue has gone locally and on TC too.
So you could repro by basically running dotnet test for all test DLLs (what TC did)? Simply repeated runs of the testing script running in GHA locally doesn't do anything strange on my machine.
I'm testing disabling node reuse under https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/273. We'll see if it reliably fixes the issue, but what's for sure, is that for the root solution, build takes about twice as much (~4 minutes vs ~8)... If it helps, then we can check if removing any of the switches makes it faster but still retains the fix (or perhaps only leave the env var for the test-dotnet action.
I just remembered about dotnet build-server shutdown. We don't need no-node reuse throughout, it's enough to just shut down the build servers after dotnet build. So, builds are not slower.
So you could repro by basically running dotnet test for all test DLLs (what TC did)?
Yes exactly.
OK, thanks.
Happened again here.
So, until now, disabled node reuse (or rather, shutting down build servers) and input redirection didn't help.
Perhaps the problem will get some attention from Microsoft now? https://github.com/microsoft/vstest/issues/2080#issuecomment-1331877872
ui-test-parallelism: 0 didn't help, since the issue happened again here. I'll try xUnit "maxParallelThreads": -1 though that needs some restructuring of how our GHA workflows handle configuration.
Apparently, the xUnit "maxParallelThreads": -1 trick didn't make the issue go away, but it did change it. This run and even its rerun timed out, during Tests as usual. However, while formerly the step got stuck after finishing all tests of Lombiq.OSOCE.NuGet.Tests.UI\, now it was in the middle of the test execution:
Timing out after starting executing tests in the project but no output from the actual tests:

Getting stuck between tests:

Interesting, and this makes a case for the "fix for snyc-over-async in WebApplicationFactory" attempt, since this can be related to what the tests do (as opposed to just what dotnet test does).
I'll wait for some more data to come in.
@0liver you told about this one happening, thanks.
Since this wasn't a NuGetTest build, nor did it use the -1 xUnit config (but 0, as before, since that came with ui-test-parallelism before too; see here), it also suggests that we need the WebApplicationFactory fix. However, it didn't start test execution at all, not even reaching this line. Most possibly it got stuck here, which we can do nothing about, since it's not running tests yet at all...
This run got stuck between two tests. This is suggestive of the WebApplicationFactory fix.
Now we wait to see if the WebApplicationFactory fix solves the issue.
And it doesn't, this build failed. It hung between two tests. So, the xUnit "maxParallelThreads": -1 config apparently didn't cause this, i.e. to hang between tests as opposed to between test projects.
I'm out of ideas since I've tried everything that I could think of and that everyone suggested. Perhaps any other suggestions from you @BenedekFarkas @DAud-IcI @dministro @0liver?
This has been beyond me for quite some time. So, unfortunately, no.
- I've started a build in TC, just in case it yields any useful information about timings: https://ci.lombiq.com/buildConfiguration/OrchardExtensions_OSOCE_Developer_PullAndBuild/1143340
- We can build an updated runner image and spin up a VM based on it, so you can monitor the runner machine real-time.
- For local testing, try changing your BIOS settings to match the core/thread count of the runner machines.
- .NET SDK downgrade? We could stress-test against multiple different versions to see which one(s) produce this problem.
The only thing that comes to my mind, running the tests locally in a loop until the error comes, then attaching the debugger to the running process. It looks like the issue affects windows runner which means for me, we can reproduce it on windows(locally) so this simplifies it.
What about the distribution between standard and larger runners? Is there a significant difference? As I see the linked failed builds, most of them are on standard-runner, except the OSOE-84 one.
Thank you for your tips! I'll follow up with these if the issue still surfaces. Because I just noticed that the NuGetTest build, that recently failed, wasn't actually using the new code, since I didn't NuGet-publish and update the UI Testing Toolbox... I just did that now, it's in dev, so waiting for results.
Yes, this is by far the most of an issue with the Windows NuGetTest build, which happens on the 2-core standard runners, It much more rarely happens, but sometimes it does, with the full solution builds, also under Ubuntu (I don't know it every happening for the NuGetTest build under Ubuntu though). So yeah, we should be able to reproduce if constrained to a 2-core CPU.
I’ve had Tests hanging in a root build just last night, on the dev branch (Windows 2core): https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3700330664/jobs/6268653535#step:10:3968
And another root build timed out just now, on a Ubuntu-22.04-4core runner: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3704714311/jobs/6277678502#step:10:63
Thank you. This makes it clear that the WebApplicationFactory change didn't fix this. I'll look into this more, then.
This will come handy: https://www.meziantou.net/generating-a-dump-file-when-tests-hang-on-a-ci-machine.htm
Let's hit this:

I think we could become a GitHub sponsor for this guy - we're using his Analyzers in our daily work already, and he provides some really valuable content.
We'll definitely consider him as well during our yearly review of OSS projects/contributors to sponsor.
@meziantou would you be open to investigating and fixing this dotnet test hang as contract work? I don't see a way to get in touch with you privately, so if you would, then please write me an e-mail, it's visible under my profile.
Seems like a good tip: https://github.com/dotnet/sdk/issues/9452#issuecomment-1432516761
This is pretty painful read. :D I mean in the sense that changing your log verbosity fixing (or at least mitigating) this is so random...
Some fresh failed runs from here.
NuGetTest - windows 8 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4090829148/jobs/7054512573 root solution - ubuntu 4 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4048681485/jobs/6964219088 root solution - windows: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4024330084/jobs/6916236935 root solution - windows 8 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4017315201/jobs/6901509104 root solution - ubuntu 4 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4017170132/jobs/6901173220 root solution - windows: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3944558324/jobs/6750596419
https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4254441171/jobs/7400693171 https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4269491317/jobs/7432709938
BTW the binlogs we have enabled for OSOCE now (and available as artifacts under the builds you linked) might help.