UI-Testing-Toolbox UI testing randomly getting stuck in CI (OSOE-464)

See e.g. this build. The NuGetTest UI testing of OSOCE got stuck and thus timed out in 30 minutes, while all the other builds worked fine. Rerunning just that build fixed the issue. This problem started way before these recent examples and the Orchard Core 1.5 upgrade. Reruns resolve the issue every time.

Other examples unrelated to the 1.5 upgrade: here, here, here, here.

Troubleshooting this is made harder by it apparently not being possible to finish the test-dotnet action and upload artifacts when the build is canceled, see https://github.com/Lombiq/GitHub-Actions/issues/77.

It seems this bug didn't solve itself in the end: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/issues/126

Jira issue

See comments for notes.

To be done:

[x] Check if this only happens if tests fail and are then retired, due to Task.Delay() potentially having an effect. - No, this may also happen if there are no failing tests, see here.
[x] Test fix for snyc-over-async in WebApplicationFactory. See:
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/317
- https://github.com/Lombiq/UI-Testing-Toolbox/pull/243
- https://github.com/Lombiq/Helpful-Libraries/pull/172
[x] Test input redirection.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/commit/206ff3ebc88bf955f3cc7697f78ac162aa73ad9b
- https://github.com/Lombiq/GitHub-Actions/commit/83f47fd5f25721291bc54dff7e10547df484ff67
[x] Test disabled node reuse.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/273
- https://github.com/Lombiq/GitHub-Actions/pull/105
[x] Test ui-test-parallelism: 0 for NuGetTest builds. Note though that this is already the case for root builds using the larger runners, and it still happened there under Ubuntu.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/303
[x] Test xUnit "maxParallelThreads": -1.
- https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/309
- https://github.com/Lombiq/GitHub-Actions/pull/133
[ ] Investigate hangs with dump file, see this blog post for tips.
- https://github.com/Lombiq/GitHub-Actions/pull/215
- Investigation: https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1453287080
[x] Fix sync over async in OC RecipeHarvester
- https://github.com/Lombiq/UI-Testing-Toolbox/pull/263
- Here is an open issue for this in OC: https://github.com/OrchardCMS/OrchardCore/issues/10329
[ ] Check all processes (eg. smtp4dev) stopped properly. Hunting for zombie processes...
- It looks like the hang timeout does not help
- Maybe we can use Wait-Process or something similar

Stuck tests outside the scope of `blame-hang`

Comment	OS	Runner	Build
https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1507597413	Windows	standard	NuGet
https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1458685475, https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1458696394	Windows	standard	root
https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-1561842178	Windows	standard	NuGet

Nov 11 '22 03:11 Piedone

I couldn't reproduce this yet with 140 builds, see the https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/workflows/build-and-test.yml?query=branch%3Aissue%2FOSOE-464. However, the currently added logging shows the details of UI test execution, even if the build is canceled due to a timeout, see here.

So, I think we should merge the new logging, and wait for the issue to randomly arise again. At that point, we'll be able to determine the cause better.

Nov 12 '22 22:11 Piedone

Notes:

Until now, This happened exclusively with the NuGetTest build under Windows. This is a new case that might be the same: With the new, streaming test logging, this failed for the root solution and under Ubuntu (i.e. the polar opposite...).

The log here shows that out of the 6 test projects, only tests for 2 were run: Lombiq.TrainingDemo.Tests and Lombiq.Tests.UI.Samples. There, apparently, all 26 tests have run:

2022-11-16T11:34:32.5767718Z Results File: /home/runner/work/Open-Source-Orchard-Core-Extensions/Open-Source-Orchard-Core-Extensions/test/Lombiq.UITestingToolbox/Lombiq.Tests.UI.Samples/TestResults/test-results.trx
2022-11-16T11:34:32.5768672Z 
2022-11-16T11:34:32.5816191Z Test Run Successful.
2022-11-16T11:34:32.5816903Z Total tests: 26
2022-11-16T11:34:32.5818566Z      Passed: 26
2022-11-16T11:34:32.5820511Z  Total time: 4.9606 Minutes

The following tests I checked and indeed have run (and they all passed) before the build timed out. There are actually 28 "Finishing execution of" runs for the supposedly 26 tests!

AccessibilityTest.FrontendPagesShouldBeAccessible
BasicVisualVerificationTests.VerifyHomePage x2, for Chrome and Edge
BasicTests.LoginShouldWork
ErrorHandlingTests.ServerSideErrorOnLoadedPageShouldHaltTest
EmailTests.SendingTestEmailShouldWork
MonkeyTests.TestAdminPagesAsMonkeyRecursivelyShouldWorkWithAdminUser
SqlServerTests.TogglingFeaturesShouldWorkWithSqlServer
TenantTests.CreatingTenantShouldWork
AzureBlobStorageTests.AnonymousHomePageShouldExistWithAzureBlobStorage
BasicTests.TogglingFeaturesShouldWork
ErrorHandlingTests.BrowserLogsShouldPersist
MonkeyTests.TestAdminBackgroundTasksAsMonkeyRecursivelyShouldWorkWithAdminUser
BasicOrchardFeaturesTests.BasicOrchardFeaturesShouldWork - Two "Finishing execution" messages since it failed once but was retried.
SqlServerTests.AnonymousHomePageShouldExistWithSqlServer
BasicTests.ShortcutsShouldWork
AzureBlobStorageTests.TogglingFeaturesShouldWorkWithAzureBlobStorage
ErrorHandlingTests.ErrorDuringSetupShouldHaltTest
BasicVisualVerificationTests.VerifyNavbar x2, for Chrome and Edge
BasicTests.AnonymousHomePageShouldExist
MonkeyTests.TestCurrentPageAsMonkeyRecursivelyShouldWorkWithAnonymousUser
ErrorHandlingTests.ClientSideErrorOnLoadedPageShouldHaltTest
DatabaseSnapshotTests.BasicOrchardFeaturesShouldWorkWithExistingDatabase - Two "Finishing execution" messages but not because fails and retries. It was actually started twice, 1:15 mins apart. This is because to test resume from an existing DB, it starts a test within a test.
MonkeyTests.TestCurrentPageAsMonkeyShouldWorkWithConfiguredRandomSeed
BasicVisualVerificationTests.VerifyBlogImage

In the end, all tests have run from the project, just a couple of them more than once. The above list contains 24 tests + 2 Edge executions (normal until now) + 2 duplicated "Finishing execution" messages.

Log: 9_Tests.txt

So, nothing strange here. It's just that nothing happens after the "Total time" line. Then we should have a "Test successful" or "Test failed". But, this never arrives, the script is stuck between this line and this one. I guess it doesn't get stuck on if ($?) but somewhere in dotnet test.

Nov 16 '22 20:11 Piedone

dotnet test hanging is not an issue just for us.

https://github.com/microsoft/vstest/issues/2080 Even James Newton-King himself had this.
- For JNK this was only with failing tests. However, in our case, tests aren't failing. After Lombiq.Tests.UI.Samples comes Lombiq.HelpfulLibraries.Tests (until the set of test projects, or at least the projects in the solution don't change, then the order should be deterministic), but it's not reached, thus unreported tests can't fail there. Nor can they in Lombiq.Tests.UI.Samples, since that project completed without issues.
- noprogress logging might work, but then we lose visible progress. Redirecting the input from /dev/null might solve this. See https://github.com/microsoft/vstest/issues/2080#issuecomment-539879345 for both. For that person this was only an issue under Linux and bash, so again the polar opposite of ours (for the most part).
- The input redirect trick didn't help others, but more CPU cores did (the last example of our problem happened on an 4-core Ubuntu machine, but usually happens under Windows with 2 cores). Unlimited threads for xUnit might help too.
- This corresponds to the previous one: Running tests with WebApplicationFactory can cause them to deadlock, because there's still a sync-over-async in TestServer. Here's the fix. I experienced hangs when changing the thread count for xUnit too, see OSOE-149. But! Our tests don't hang actually, since they supposedly fully finish. However, maybe it's worth trying out since perhaps this issue started to manifest after https://github.com/Lombiq/UI-Testing-Toolbox/pull/186. Corresponding ASP.NET Core issue: https://github.com/dotnet/aspnetcore/issues/43353.
- We have GetResult() in a couple of places. The only it seems that's not inevitable is in TestReverseProxy. But Dispose(bool disposing) doesn't seem to actually be called.
https://github.com/dotnet/sdk/issues/9452 This is exactly the same symptom as ours.
- A possible tip is disabling node reuse, also see the env var approach here. This comment hints that this is a bug in Process, but still, disabling node reuse can help. https://github.com/microsoft/vstest/issues/1503 shows that this has been a known incompatibility between MSBuild and dotnet test since 2018.
- The same thing apparently happens with dotnet run roo.
- Maybe [--blame-hang-timeout](https://learn.microsoft.com/en-us/dotnet/core/tools/dotnet-test#:~:text=blame%2Dhang.-,%2D%2Dblame%2Dhang%2Dtimeout,-%3CTIMESPAN%3E%20(Available) would be useful to figure out the cause. But then again, it's not the tests that hang.
- Adding a Microsoft.NET.Test.Sdk package reference to all tests might help. We already have this (for projects actually containing test classes, not UI tests extension methods).
https://github.com/dotnet/sdk/issues/27106 Same issue, but no new info.
https://github.com/dotnet/aspnetcore/issues/38649 Related.
- We have a couple of Task.Delay()s. Removing all of them is not an option, however, only in UITestExecutionSession it could be moved outside of the app's scope.

BTW @dministro since this hanging seems to correlate with https://github.com/Lombiq/UI-Testing-Toolbox/pull/186, does something ring a bell?

Nov 16 '22 22:11 Piedone

Yes, i faced deadlock in NuGetTests some times on TC, but newer GHA. I investigated it, and replaced all lock() statements inside async methods using SemaphorSlim and thread safe collections. I can collect all relevant commits. It was really simple to reproduce locally, I started the NuGetTests like TC does, and usually the deadlock come up after 2-3 runs. After I removed lock() the way as I mentioned above, the issue has gone locally and on TC too.

Nov 16 '22 23:11 dministro

So you could repro by basically running dotnet test for all test DLLs (what TC did)? Simply repeated runs of the testing script running in GHA locally doesn't do anything strange on my machine.

I'm testing disabling node reuse under https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/273. We'll see if it reliably fixes the issue, but what's for sure, is that for the root solution, build takes about twice as much (~4 minutes vs ~8)... If it helps, then we can check if removing any of the switches makes it faster but still retains the fix (or perhaps only leave the env var for the test-dotnet action.

Nov 16 '22 23:11 Piedone

I just remembered about dotnet build-server shutdown. We don't need no-node reuse throughout, it's enough to just shut down the build servers after dotnet build. So, builds are not slower.

Nov 17 '22 02:11 Piedone

So you could repro by basically running dotnet test for all test DLLs (what TC did)?

Yes exactly.

Nov 17 '22 08:11 dministro

OK, thanks.

Nov 17 '22 14:11 Piedone

Happened again here.

So, until now, disabled node reuse (or rather, shutting down build servers) and input redirection didn't help.

Nov 28 '22 13:11 Piedone

Perhaps the problem will get some attention from Microsoft now? https://github.com/microsoft/vstest/issues/2080#issuecomment-1331877872

Nov 30 '22 12:11 Piedone

ui-test-parallelism: 0 didn't help, since the issue happened again here. I'll try xUnit "maxParallelThreads": -1 though that needs some restructuring of how our GHA workflows handle configuration.

Dec 05 '22 16:12 Piedone

Apparently, the xUnit "maxParallelThreads": -1 trick didn't make the issue go away, but it did change it. This run and even its rerun timed out, during Tests as usual. However, while formerly the step got stuck after finishing all tests of Lombiq.OSOCE.NuGet.Tests.UI\, now it was in the middle of the test execution:

Timing out after starting executing tests in the project but no output from the actual tests:

Getting stuck between tests:

Interesting, and this makes a case for the "fix for snyc-over-async in WebApplicationFactory" attempt, since this can be related to what the tests do (as opposed to just what dotnet test does).

I'll wait for some more data to come in.

Dec 08 '22 23:12 Piedone

@0liver you told about this one happening, thanks.

Since this wasn't a NuGetTest build, nor did it use the -1 xUnit config (but 0, as before, since that came with ui-test-parallelism before too; see here), it also suggests that we need the WebApplicationFactory fix. However, it didn't start test execution at all, not even reaching this line. Most possibly it got stuck here, which we can do nothing about, since it's not running tests yet at all...

Dec 09 '22 17:12 Piedone

This run got stuck between two tests. This is suggestive of the WebApplicationFactory fix.

Dec 11 '22 23:12 Piedone

Now we wait to see if the WebApplicationFactory fix solves the issue.

Dec 13 '22 21:12 Piedone

And it doesn't, this build failed. It hung between two tests. So, the xUnit "maxParallelThreads": -1 config apparently didn't cause this, i.e. to hang between tests as opposed to between test projects.

I'm out of ideas since I've tried everything that I could think of and that everyone suggested. Perhaps any other suggestions from you @BenedekFarkas @DAud-IcI @dministro @0liver?

Dec 14 '22 00:12 Piedone

This has been beyond me for quite some time. So, unfortunately, no.

Dec 14 '22 00:12 0liver

I've started a build in TC, just in case it yields any useful information about timings: https://ci.lombiq.com/buildConfiguration/OrchardExtensions_OSOCE_Developer_PullAndBuild/1143340
We can build an updated runner image and spin up a VM based on it, so you can monitor the runner machine real-time.
For local testing, try changing your BIOS settings to match the core/thread count of the runner machines.
.NET SDK downgrade? We could stress-test against multiple different versions to see which one(s) produce this problem.

Dec 14 '22 12:12 BenedekFarkas

The only thing that comes to my mind, running the tests locally in a loop until the error comes, then attaching the debugger to the running process. It looks like the issue affects windows runner which means for me, we can reproduce it on windows(locally) so this simplifies it.

What about the distribution between standard and larger runners? Is there a significant difference? As I see the linked failed builds, most of them are on standard-runner, except the OSOE-84 one.

Dec 14 '22 14:12 dministro

Thank you for your tips! I'll follow up with these if the issue still surfaces. Because I just noticed that the NuGetTest build, that recently failed, wasn't actually using the new code, since I didn't NuGet-publish and update the UI Testing Toolbox... I just did that now, it's in dev, so waiting for results.

Yes, this is by far the most of an issue with the Windows NuGetTest build, which happens on the 2-core standard runners, It much more rarely happens, but sometimes it does, with the full solution builds, also under Ubuntu (I don't know it every happening for the NuGetTest build under Ubuntu though). So yeah, we should be able to reproduce if constrained to a 2-core CPU.

Dec 15 '22 01:12 Piedone

I’ve had Tests hanging in a root build just last night, on the dev branch (Windows 2core): https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3700330664/jobs/6268653535#step:10:3968

And another root build timed out just now, on a Ubuntu-22.04-4core runner: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3704714311/jobs/6277678502#step:10:63

Dec 15 '22 14:12 0liver

Thank you. This makes it clear that the WebApplicationFactory change didn't fix this. I'll look into this more, then.

Dec 15 '22 23:12 Piedone

This will come handy: https://www.meziantou.net/generating-a-dump-file-when-tests-hang-on-a-ci-machine.htm

Dec 21 '22 11:12 Piedone

Let's hit this:

I think we could become a GitHub sponsor for this guy - we're using his Analyzers in our daily work already, and he provides some really valuable content.

Dec 27 '22 09:12 0liver

We'll definitely consider him as well during our yearly review of OSS projects/contributors to sponsor.

Dec 27 '22 15:12 Piedone

@meziantou would you be open to investigating and fixing this dotnet test hang as contract work? I don't see a way to get in touch with you privately, so if you would, then please write me an e-mail, it's visible under my profile.

Jan 03 '23 22:01 Piedone

Seems like a good tip: https://github.com/dotnet/sdk/issues/9452#issuecomment-1432516761

Feb 16 '23 17:02 Piedone

This is pretty painful read. :D I mean in the sense that changing your log verbosity fixing (or at least mitigating) this is so random...

Feb 16 '23 17:02 BenedekFarkas

Some fresh failed runs from here.

NuGetTest - windows 8 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4090829148/jobs/7054512573 root solution - ubuntu 4 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4048681485/jobs/6964219088 root solution - windows: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4024330084/jobs/6916236935 root solution - windows 8 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4017315201/jobs/6901509104 root solution - ubuntu 4 core: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4017170132/jobs/6901173220 root solution - windows: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3944558324/jobs/6750596419

https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4254441171/jobs/7400693171 https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/4269491317/jobs/7432709938

Feb 16 '23 17:02 dministro

BTW the binlogs we have enabled for OSOCE now (and available as artifacts under the builds you linked) might help.

Feb 18 '23 11:02 Piedone

UI testing randomly getting stuck in CI (OSOE-464)

Stuck tests outside the scope of blame-hang

Stuck tests outside the scope of `blame-hang`