FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Run full GC only for auto recovery scenarios

Open agarwal-navin opened this issue 1 year ago • 0 comments

Currently, GC runs will fullGC a little often which also results in running summarize with fullTree which is expensive. Here is the list of scenarios where full GC runs today and an analysis of whether full GC or fullTree summary is needed. This PR updates the code based on this table.

Scenario Full GC needed Full Tree summary needed
A container's base snapshot doesn't have any GC state and GC is enabled. No. If there is no GC state in base snapshot, all nodes will regenerate the GC data. So, full GC doesn't have additional value. It only makes running summaries expensive since they will run with full tree. No
A container has GC enabled via metadata, but it is disabled via RunGC test config. No. If GC is disabled, GC won't run, and GC state won't be added to the next summary. So full GC is unnecessary. Yes. The previous summary for data store nodes can have unreferenced property set which may need resetting. In this case, the summarizer node with GC will detect this case and trigger re-summarization.
A container has GC disabled via metadata, but it is enabled via RunGC test config. No. The GC state should not exist in base snapshot (like row 1).Maybe we could add a check for this, but it's not necessary. Even if GC state exists, it was not created via older versions or this would fall into the category of row # 2. So, it shouldn't matter. No
Base snapshot's GC version is newer than current GC version No. In this scenario, GC will be disabled and won't run. Also, GC state won't be added to the summary. So, full GC is unnecessary. No. The base snapshot may have marked nodes as unreferenced via the unreferenced property. It's fine to not reset it because the current version is outdated. Also, due to same reason as row # 2, the property will get reset anyway.
Base snapshot's GC version is older than current GC version No. In this scenario, the GC state generated by the older version may be incorrect. We need to regenerate the full GC state. This will happen automatically because we set the base GC state to undefined and as a result, every node will regenerate GC data. No. The previous summary may have marked some nodes incorrectly unreferenced via the unreferenced property. Those need to be reset. However, it will happen automatically since the nodes won't have base GC state and hence the used routes will have changed.
Auto-recovery for tombstone errors Yes. The tombstone error indicates that unreferenced content is being incorrectly used. This could be a bug in how GC collects GC data and full GC should fix it. No. The previous summary may have marked some nodes incorrectly unreferenced via the unreferenced property. Those need to be reset. However, it will happen automatically since the used routes of these nodes should have changed.
 

AB#8037

Todo

  • [ ] Add tests for all the scenarios above. Tests exist for 3 of the scenarios.
  • [ ] Fix the validation in tests to check for full GC rather than for fullTree summary.

agarwal-navin avatar May 16 '24 23:05 agarwal-navin