GEOS icon indicating copy to clipboard operation
GEOS copied to clipboard

feat: Simplify region statistics output in HDF5

Open arng40 opened this issue 1 year ago • 14 comments

this PR is based on the work of @wrtobin & @untereiner : Memory Space Aware Packing , Export info embedded on regions

With the current impl, in order to output region statistics we would have to create :

  • 1 <CompositionalMultiphaseStatistics> component + 1 <PeriodicEvent>
  • 1 <TimeHistory> per region with each desired statistics fields (hidden to the user) + 1 <periodicEvent> per <TimeHistory>
  • 1 <PackCollection> per field and 1 <PeriodicEvent> per <packCollection> With multiple regions this could easily lead to hundred of components to create and maintain manually
<Tasks>

    <CompositionalMultiphaseStatistics
      name="compflowStatistics"
      flowSolverName="compflow"
      logLevel="1"
      computeCFLNumbers="1"
      computeRegionStatistics="1"/>

    <PackCollection
      name="pressureRegion1Collection"
      objectPath="Path/to/field/"
      fieldName="pressure"/>
[Omitting around 18 fields X 3 regions ...]
    <PackCollection
      name="minTRegion1Collection"
      objectPath="Path/to/field/"
      fieldName="minT"/>

</Tasks>
 <Outputs>

    <TimeHistory
      name="timeHistoryOutput"
      sources="{ Path/toregionA/field }"
      filename="outputRegions"/>
[Omitting 2 regions ...]
    <TimeHistory
      name="timeHistoryOutput"
      sources="{ Path/toregionD/field }"
      filename="outputRegions"/>

  </Outputs>
<Events>

    <PeriodicEvent
      name="statComputeEvent"
      forceDt="1e6"
      target="/Tasks/compflowStatistics"/> 

    <PeriodicEvent
      name="pressureCollectionEvent"
      forceDt="1e6"
      target="/Tasks/pressureRegion1Collection"/>
[Omitting around 18 fields X 3 regions ...]
    <PeriodicEvent
      name="pressureCollectionEvent"
      forceDt="1e6"
      target="/Tasks/minTRegion1Collection"/>

    <PeriodicEvent
      name="statOutputEvent"
      forceDt="1e6"
      target="/Output/timeHistoryOutput"/>

</Events>

With the controller component proposed in this PR, we would have all <PackCollection>, <TimeHistory> and their <PeriodicEvent> automatically created / managed :

<Problem>
     
  <Solvers>
    <CompositionalMultiphaseFVM
      name="compflow"
      logLevel="1"
      [...]
      />
    </CompositionalMultiphaseFVM>
  </Solvers>    

  <Events
    maxTime="8.64e6">
    <PeriodicEvent
      name="statistics"
      timeFrequency="1e6"
      target="/Tasks/statController"/>
  </Events>

  <Tasks>
    <StatOutputController name="statController" >
      <CompositionalMultiphaseStatistics
        name="compflowStatistics"
        flowSolverName="compflow"
        logLevel="1"
        computeCFLNumbers="1"
        computeRegionStatistics="1"
      />
    </StatOutputController>

  </Tasks>
  
</Problem>

arng40 avatar Aug 29 '24 09:08 arng40

<Events>
  <PeriodicEvent
        name="statistics"
        timeFrequency="1e6"
        target="/Tasks/testStats"/>
</Events>

I'm missing something. In this example what does the target /Tasks/testStats refer to? Should it be /Tasks/statController?

dkachuma avatar Sep 01 '24 02:09 dkachuma

Two more requests:

  • Can we have an optional outputDir attribute (which is ./ by default)
  • Do you have a unit test? If not you should add one.

MelReyCG avatar Sep 26 '24 09:09 MelReyCG

@wrtobin Thanks for your commits! All tests seem to pass now. If @arng40 rebaseline, fix docs & code style, is this PR ready from a functional standpoint?

MelReyCG avatar Oct 02 '24 07:10 MelReyCG

The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms.

Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures.

Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves.

I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.

wrtobin avatar Oct 02 '24 15:10 wrtobin

The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms.

Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures.

Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves.

I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.

What kind and how many failures are we talking about?

CusiniM avatar Oct 03 '24 08:10 CusiniM

The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms. Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures. Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves. I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.

What kind and how many failures are we talking about?

Many. Mostly failure to converge and restart-check failures due to linear solver parameter differences. However there are also out of memory errors etc which are nondeterministic and indicate to me that our lsf launch params or scheduling is faulty in some way. So the set of failing tests (on develop) changes each time you run, combined with the restartcheck failures meaning that basically none of the test groups succeeds so -a continue won't really allow repeated invocations of the test suite to eventually succeed despite the launch/scheduling issues (when I ran on develop on mon/tues this week, 4 test groups passed on the first run, out of hundreds).

wrtobin avatar Oct 03 '24 14:10 wrtobin

None of this is immediately indicative of problems with the code, but instead with the scheduling and the test suite restartcheck baselines including the linear solver parameters which often necessarily differ between CPU and GPU platforms. Unfortunately the nondetermistic nature of the failures and inability to -a continue precludes the ability to meaningfully compare the sets of failures between develop and this branch.

We can/should add the linear solver params to the whitelist of allowed baseline differences, which might help the -a continue issue and at least get us to a point where the only failures are nonconvergence issues, which might be comparable between develop and the branch. It'll probably just be a slow process, I'll see whether this approach is feasible.

wrtobin avatar Oct 03 '24 15:10 wrtobin

@arng40 You need to alter or add a unitTest / integratedTest to cover your code.

MelReyCG avatar Oct 17 '24 09:10 MelReyCG

@arng40 once you have added the test that @MelReyCG is requesting, please ping @wrtobin, @CusiniM and @rrsettgast to have this PR merged at a time where we could easily revert it if necessary (cf. @wrtobin's comment).

herve-gross avatar Oct 28 '24 22:10 herve-gross

Does unit tests & staircase_3d run on P3?

MelReyCG avatar Nov 06 '24 15:11 MelReyCG

@arng40 @MelReyCG what is the status of this PR?

paveltomin avatar Aug 08 '25 18:08 paveltomin

why StatOutputController is needed and not statistics doing all that by itself?

paveltomin avatar Aug 19 '25 14:08 paveltomin

@paveltomin the merge will be very hard for me since in #2968 introduce packing for vector and merge that alongside Bill's work is complicated or I can restart with the develop with another pr and try to cherry pick

arng40 avatar Aug 22 '25 14:08 arng40

@paveltomin The status of this PR is in pause. We have a pretty blocking issue on some target platform which block us to pack "1D" arrays (to output them in HDF5) from GPU. I took a pretty long time to debug it already, but did not achieve to precisely find the source of the issue. I suspect not correctly aligned data, or something related. I'm on other tasks right now but will come back to it later.

MelReyCG avatar Aug 22 '25 15:08 MelReyCG