[Kitten] - Acquire OptProf build issue diagnostic data
Context
Our OptProf build tests have notrivial failure rate, that is caused by staling Solution Unload test step. We need more diagnostic data (ETW, dump) so that Project System team can investigate further.
Symptomps
Error message: Test 'OpenAndCloseProjectTestSolution' exceeded execution timeout period.
More details:
+ Close VS
Warning: (2:58.513) [Platform:TestLoggerDefault] Unable to retrieve an "IVsResourceManagerCacheControl".
+ Test Cleanup
+ TestCleanup
(19:59.204) [Platform:Testcase] ResultArchive: Copying C:\Windows\Microsoft.NET\Framework\v4.0.30319\ngen.log to C:\Test\Re\Archived Artifacts\NgenLogs\Framework\v4.0.30319
(19:59.209) [Platform:Testcase] ResultArchive: Copying C:\Windows\Microsoft.NET\Framework64\v4.0.30319\ngen.log to C:\Test\Re\Archived Artifacts\NgenLogs\Framework64\v4.0.30319
(20:04.230) [Platform:Testcase] ResultArchive: Copying C:\Test\Results\Deploy_*** 2024-01-13 11_02_42\Out\OmniLog.html to C:\Test\Re\Archived Artifacts
+ Final Test Result Verification
The VS instance seems to be stuck during closing.
This correlates with observations on screen capture from the test run - where unloading projects seem to be stuck:
Steps to collect diagnostics
-
In the pipeline MSBuild-OptProf in the top right corner click ‘Run pipeline’
-
Set testMachineCleanUpStrategy to ‘stop’
-
Click Run to deploy the test run.
-
Once the run completes (which takes roughly 3 hours) the test machine should hang around for 3 days, and to get onto the test machine you can use the DevDivLabConnector (wiki) tool. You’ll need the machine name and in Rerun OptProf on a Lab Machine (wiki) shows where to get the machine name and how to re-run test.
-
Start the ETW collecting prior the repro run:
perfview collect /NoGui /Providers=*Microsoft-Build /BufferSize:8096 /CircularMB:8096 /NoNGenRundown /DataFile:<path-to-trace-file-that-will-be-created>
- If the issue reproduces (VS hangs in project unloading for several minutes), create 2 or 3 memory dumps (minidump should suffice) couple dozens of seconds appart. For dump creation, you can use e.g. ProcessExplorrer:
Notes
- It's not clear if the issue is only by timing or by env as well - but it's probably best idea to try to repro on machines that experienced the issue during the initial run.
- It might not be easy to reproduce - so possibly having more machines from the CI pool to run the repro might help to get it quicker.
- The machines are kept only for certain duration (I believe 48 or 72 hours) - and then it gets force recycled
- Copy out the collected data to some internal share or alternatively sharepoint
minidump should suffice
A minidump isn't going to contain the information we need. We need a full dump in order to see async task chains.
Current strategy: track the build state within a week. If no repro, we can close the ticket.
It's not repro any more, close it.