feat: Simplify region statistics output in HDF5
this PR is based on the work of @wrtobin & @untereiner : Memory Space Aware Packing , Export info embedded on regions
With the current impl, in order to output region statistics we would have to create :
- 1
<CompositionalMultiphaseStatistics>component + 1<PeriodicEvent> - 1
<TimeHistory>perregionwith each desired statistics fields (hidden to the user) + 1<periodicEvent>per<TimeHistory> - 1
<PackCollection>perfieldand 1<PeriodicEvent>per<packCollection>With multiple regions this could easily lead tohundredof components to create and maintain manually
<Tasks>
<CompositionalMultiphaseStatistics
name="compflowStatistics"
flowSolverName="compflow"
logLevel="1"
computeCFLNumbers="1"
computeRegionStatistics="1"/>
<PackCollection
name="pressureRegion1Collection"
objectPath="Path/to/field/"
fieldName="pressure"/>
[Omitting around 18 fields X 3 regions ...]
<PackCollection
name="minTRegion1Collection"
objectPath="Path/to/field/"
fieldName="minT"/>
</Tasks>
<Outputs>
<TimeHistory
name="timeHistoryOutput"
sources="{ Path/toregionA/field }"
filename="outputRegions"/>
[Omitting 2 regions ...]
<TimeHistory
name="timeHistoryOutput"
sources="{ Path/toregionD/field }"
filename="outputRegions"/>
</Outputs>
<Events>
<PeriodicEvent
name="statComputeEvent"
forceDt="1e6"
target="/Tasks/compflowStatistics"/>
<PeriodicEvent
name="pressureCollectionEvent"
forceDt="1e6"
target="/Tasks/pressureRegion1Collection"/>
[Omitting around 18 fields X 3 regions ...]
<PeriodicEvent
name="pressureCollectionEvent"
forceDt="1e6"
target="/Tasks/minTRegion1Collection"/>
<PeriodicEvent
name="statOutputEvent"
forceDt="1e6"
target="/Output/timeHistoryOutput"/>
</Events>
With the controller component proposed in this PR, we would have all <PackCollection>, <TimeHistory> and their <PeriodicEvent> automatically created / managed :
<Problem>
<Solvers>
<CompositionalMultiphaseFVM
name="compflow"
logLevel="1"
[...]
/>
</CompositionalMultiphaseFVM>
</Solvers>
<Events
maxTime="8.64e6">
<PeriodicEvent
name="statistics"
timeFrequency="1e6"
target="/Tasks/statController"/>
</Events>
<Tasks>
<StatOutputController name="statController" >
<CompositionalMultiphaseStatistics
name="compflowStatistics"
flowSolverName="compflow"
logLevel="1"
computeCFLNumbers="1"
computeRegionStatistics="1"
/>
</StatOutputController>
</Tasks>
</Problem>
<Events> <PeriodicEvent name="statistics" timeFrequency="1e6" target="/Tasks/testStats"/> </Events>
I'm missing something. In this example what does the target /Tasks/testStats refer to? Should it be /Tasks/statController?
Two more requests:
- Can we have an optional outputDir attribute (which is
./by default) - Do you have a unit test? If not you should add one.
@wrtobin Thanks for your commits! All tests seem to pass now. If @arng40 rebaseline, fix docs & code style, is this PR ready from a functional standpoint?
The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms.
Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures.
Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves.
I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.
The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms.
Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures.
Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves.
I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.
What kind and how many failures are we talking about?
The mem-aware packing branch was functionally complete and ready-to-go, the only reason it hasn't been merged has been related to failures of integrated tests (in develop) on our primary GPU test platform. Without the ability to verify develop itself is fully functional, I can't provably guarantee this branch is bug-free on GPU platforms. Being such a low-level infrastructure functionality means without being able to show it doesn't alter our numerical baselines or invalidate our test results in some way, it wouldn't be advisable to merge it using our standard PR procedures. Essentially I cannot provide the level of guarantee I usually would to a major infrastructure PR, though am obviously fully invested in fixing bugs if they present themselves. I only think it should be merged currently if the stakeholders are all comfortable merging it (and we are ready to revert the PR and bugfix in the case a fundamental numerical break is discovered). Otherwise fixing our GPU platform integrated tests in develop is a blocker.
What kind and how many failures are we talking about?
Many. Mostly failure to converge and restart-check failures due to linear solver parameter differences. However there are also out of memory errors etc which are nondeterministic and indicate to me that our lsf launch params or scheduling is faulty in some way. So the set of failing tests (on develop) changes each time you run, combined with the restartcheck failures meaning that basically none of the test groups succeeds so -a continue won't really allow repeated invocations of the test suite to eventually succeed despite the launch/scheduling issues (when I ran on develop on mon/tues this week, 4 test groups passed on the first run, out of hundreds).
None of this is immediately indicative of problems with the code, but instead with the scheduling and the test suite restartcheck baselines including the linear solver parameters which often necessarily differ between CPU and GPU platforms. Unfortunately the nondetermistic nature of the failures and inability to -a continue precludes the ability to meaningfully compare the sets of failures between develop and this branch.
We can/should add the linear solver params to the whitelist of allowed baseline differences, which might help the -a continue issue and at least get us to a point where the only failures are nonconvergence issues, which might be comparable between develop and the branch. It'll probably just be a slow process, I'll see whether this approach is feasible.
@arng40 You need to alter or add a unitTest / integratedTest to cover your code.
@arng40 once you have added the test that @MelReyCG is requesting, please ping @wrtobin, @CusiniM and @rrsettgast to have this PR merged at a time where we could easily revert it if necessary (cf. @wrtobin's comment).
Does unit tests & staircase_3d run on P3?
@arng40 @MelReyCG what is the status of this PR?
why StatOutputController is needed and not statistics doing all that by itself?
@paveltomin the merge will be very hard for me since in #2968 introduce packing for vector and merge that alongside Bill's work is complicated or I can restart with the develop with another pr and try to cherry pick
@paveltomin The status of this PR is in pause. We have a pretty blocking issue on some target platform which block us to pack "1D" arrays (to output them in HDF5) from GPU. I took a pretty long time to debug it already, but did not achieve to precisely find the source of the issue. I suspect not correctly aligned data, or something related. I'm on other tasks right now but will come back to it later.