msbuild [Kitten] - Acquire OptProf build issue diagnostic data

Context

Our OptProf build tests have notrivial failure rate, that is caused by staling Solution Unload test step. We need more diagnostic data (ETW, dump) so that Project System team can investigate further.

Symptomps

Error message: Test 'OpenAndCloseProjectTestSolution' exceeded execution timeout period.

More details:

+ Close VS
Warning: (2:58.513) [Platform:TestLoggerDefault] Unable to retrieve an "IVsResourceManagerCacheControl".
+ Test Cleanup
+ TestCleanup
(19:59.204) [Platform:Testcase] ResultArchive: Copying C:\Windows\Microsoft.NET\Framework\v4.0.30319\ngen.log to C:\Test\Re\Archived Artifacts\NgenLogs\Framework\v4.0.30319
(19:59.209) [Platform:Testcase] ResultArchive: Copying C:\Windows\Microsoft.NET\Framework64\v4.0.30319\ngen.log to C:\Test\Re\Archived Artifacts\NgenLogs\Framework64\v4.0.30319
(20:04.230) [Platform:Testcase] ResultArchive: Copying C:\Test\Results\Deploy_*** 2024-01-13 11_02_42\Out\OmniLog.html to C:\Test\Re\Archived Artifacts
+ Final Test Result Verification

The VS instance seems to be stuck during closing.

This correlates with observations on screen capture from the test run - where unloading projects seem to be stuck:

Steps to collect diagnostics

In the pipeline MSBuild-OptProf in the top right corner click ‘Run pipeline’
Set testMachineCleanUpStrategy to ‘stop’
Click Run to deploy the test run.
Once the run completes (which takes roughly 3 hours) the test machine should hang around for 3 days, and to get onto the test machine you can use the DevDivLabConnector (wiki) tool. You’ll need the machine name and in Rerun OptProf on a Lab Machine (wiki) shows where to get the machine name and how to re-run test.
Start the ETW collecting prior the repro run:

perfview collect /NoGui /Providers=*Microsoft-Build /BufferSize:8096 /CircularMB:8096 /NoNGenRundown /DataFile:<path-to-trace-file-that-will-be-created>

If the issue reproduces (VS hangs in project unloading for several minutes), create 2 or 3 memory dumps (minidump should suffice) couple dozens of seconds appart. For dump creation, you can use e.g. ProcessExplorrer:

Notes

It's not clear if the issue is only by timing or by env as well - but it's probably best idea to try to repro on machines that experienced the issue during the initial run.
It might not be easy to reproduce - so possibly having more machines from the CI pool to run the repro might help to get it quicker.
The machines are kept only for certain duration (I believe 48 or 72 hours) - and then it gets force recycled
Copy out the collected data to some internal share or alternatively sharepoint

Feb 20 '24 11:02 JanKrivanek

minidump should suffice

A minidump isn't going to contain the information we need. We need a full dump in order to see async task chains.

Feb 20 '24 23:02 drewnoakes

Current strategy: track the build state within a week. If no repro, we can close the ticket.

Mar 06 '24 09:03 YuliiaKovalova

It's not repro any more, close it.

Apr 18 '24 10:04 JaynieBai