Overwrite interior and priorities
Hi,
We have a geometry that's alike the one of NodeData, meaning that we get some nodes that are shared by adjacent patches on borders/corners, and for which the value should be equal.
In the code, these border/corners nodes are assigned values from large summations over floating point numbers (particles data), and although their final value should be identical, it's not exactly because of the accumulation of truncation errors.
In serial executions, this is not a problem because overlaps are sequentially processed, and only one value prevails for all patches.
In parallel however, if we unconditionally overwrite interior nodes when exchanging data with schedules, the border nodes basically gets swapped between the two PatchDatas concerned with the processed overlap. So that if they have slightly different values as a result of truncations errors, they still do after. If we unconditionally set overwrite_interior to false, then border nodes are simply not assigned and keep their slightly different values.
Over time, this slight mismatch appears to grow until shared nodes have totally different values which crashes the model.
How to deal with this ? We were hoping that setting overwrite_interior to true or false conditionally would help having one value only prevailing.
The documentation says :
The concept of ``overlap'' or data dependency is more complex for generic box geometry objects than for just cell-centered box indices in the abstract AMR index space. Problems arise in cases where data lies on the outside corners, faces, or edges of a box. For these data types, it is likely that there will exist duplicate data values on different patches.
The solution implemented here introduces the concept of ``priority'' between patches. Data of patches with higher priority can overwrite the interiors (face, node, or edge values associated with cells that constitute the interior of the patch) of patches with lower priorities, but lower priority patches can never overwrite the interiors of higher priority patches. This scheme introduces a total ordering of data and therefore eliminates the duplicate information problem.
In practice, this protocol means two things: (1) the communication routines must always process copies from low priority sources to high priority sources, and (2) patches must be given special permission to overwrite their interior values during a write. All destinations are therefore represented by three quantities: (1) the box geometry of the destination (which encodes the box, ghost cells, and geometry), (2) the box geometry of the source, and (3) a flag indicating whether the source has a higher priority than the destination (that is, whether the source can overwrite the interior of the destination). If the overwrite flag is set, then data will be copied over the specified box domain and may write into the interior of the destination. If the overwrite flag is not set, then data will be copied only into the ghost cell values and not the interior values of the patch.
however, in our override of boxgeometry, we don't really understand how this conditions should be set. In 1D, where only 2 patches can share the same node, we could say that lower rank is always overwritten by largest rank. But in 2D it seems such a condition would end-up being a race condition since a node could be shared by 3 or 4 patches, and the assignement would depend on the order in which overlaps are processed.
Is there some example or general advice as to how to set the "priority between patches" as the doc refers to?
Your observations here make sense. My suggestion would be to use the method setDeterministicUnpackOrderingFlag() which exists in both RefineSchedule and CoarsenSchedule. When set to true this causes the processing of incoming data to happen in a deterministic sequence. Combined with a priority implementation in your BoxGeometry override class that gives priority to the data from the higher rank, this should cause a single value to end up on the shared nodes.
Thanks for your quick response. We have tried your suggestion. We may not have understood it fully because doing this does not solve the problem entirely. See the following image :

This is a simple patch layout for a 1 level only simulation. Each rectangle represents a patch, it is ran on 10 MPI processes, the global ID (rank#localID) of the patch box is written in each patch.
We have 5 ghost nodes in each direction.
If you look more specifically at patches p0#1 and p1#4 : these patches share border nodes that are part of their respective interior. Doing 1 schedule that sets overwrite_interior to true only when source globalID > dest globalID makes this border strictly equal on both cores, as you suggested.
However, this border line extends on 5 ghost nodes on the p0#3 / p1#6 border. After the schedule is applied, these 5 ghosts of p0#1 and p1#4 still mismatch slightly. Our interpretation is that they do because they are processed by the schedule at a point where p0#3 / p1#6 have not yet exchanged/overwritten their own domain border nodes.
As a result, we do not see how simply following your suggestion can fix the mismatch for both shared border nodes and ghost border nodes overlapping other domain border nodes, probably because, in our opinion applying only 1 schedule cannot make it?
So what we did is :
-
apply a first schedule for which the overwrite_interior flag is set to true only when source globalID>dest global ID AND remove the interior of the source (border excluded) from the overlap.
-
apply a second schedule, where overwrite_interior is unconditionally set to false, so that only the 5 ghost nodes overlapping source interior (border excluded) of the source are set.
The first schedule makes sure that once done, there is no domain border mismatch. The second schedule updates ghost nodes which will now get the same value on borders.
We were wondering whether you would see a cleaner/simpler way to do this.
Also, we have done this by implementing a custom VariableFillPattern which is pretty much a copy paste of the BoxGeometryVariableFillPattern except the calculateOverlap method is our own. We assumed this function was not used to calculate overlaps needed for refining data between levels and that was exclusively done by the computeFillBoxesOverlap method which we have let untouched from the BoxGeometryVariableFillPattern. It seems ok, but would feel better with your confirmation.
However, this border line extends on 5 ghost nodes on the p0#3 / p1#6 border. After the schedule is applied, these 5 ghosts of p0#1 and p1#4 still mismatch slightly. Our interpretation is that they do because they are processed by the schedule at a point where p0#3 / p1#6 have not yet exchanged/overwritten their own domain border nodes.
I can see how this is possible for the nodes specifically on p0#3 / p1#6 border.
As a result, we do not see how simply following your suggestion can fix the mismatch for both shared border nodes and ghost border nodes overlapping other domain border nodes, probably because, in our opinion applying only 1 schedule cannot make it?
So what we did is :
* apply a first schedule for which the overwrite_interior flag is set to true only when source globalID>dest global ID AND remove the interior of the source (border excluded) from the overlap. * apply a second schedule, where overwrite_interior is unconditionally set to false, so that only the 5 ghost nodes overlapping source interior (border excluded) of the source are set.The first schedule makes sure that once done, there is no domain border mismatch. The second schedule updates ghost nodes which will now get the same value on borders.
We were wondering whether you would see a cleaner/simpler way to do this.
I think you have a reasonable approach. Other applications I have worked with have done something like this to separate the operations on patch boundaries from the operations in the ghost regions. One thing I can suggest is that you could use PatchLevelInteriorFillPattern for your first schedule, so that it only exchanges data on the patch boundaries. Since your second schedule writes into all of the ghosts, you don't need the first schedule to duplicate that.
Also, we have done this by implementing a custom VariableFillPattern which is pretty much a copy paste of the BoxGeometryVariableFillPattern except the calculateOverlap method is our own. We assumed this function was not used to calculate overlaps needed for refining data between levels and that was exclusively done by the computeFillBoxesOverlap method which we have let untouched from the BoxGeometryVariableFillPattern. It seems ok, but would feel better with your confirmation.
This is correct, the calcluateOverlap methods is for overlaps within the same level of resolution.
thanks
We have decided to no longer group multiple components with potentially disparate geometries for we have seen some discrepancy with refinement schedules.
Notably this interface https://github.com/LLNL/SAMRAI/blob/master/source/SAMRAI/xfer/RefineAlgorithm.C#L327
A bit of debugging of the "equivalence classes" during registration of the schedule shows some "true" comparisons, which, I can't say does or does not have something to do with the discrepancy we see, but I find it a bit odd that the fill pattern we are using is ignored and replaced with a new "BoxGeometryVariableFillPattern" https://github.com/LLNL/SAMRAI/blob/master/source/SAMRAI/xfer/RefineSchedule.C#L995
The overlaps we receive from these operations, are correct for the first registered item, but not for the latter as they have unique geometries potentially different from the first.
If you would like to see a reproduction of this issue just let me know and I'll put something together.