UnstructuredMesh::copy_nodes_and_elements() requires matching _n_parts
We have a case where we're hitting the following assert
https://github.com/libMesh/libmesh/blob/06410d04d134290b911558d1382b559220edfa57/src/mesh/unstructured_mesh.C#L79
in UnstructuredMesh::copy_nodes_and_elements() via a call to ReplicatedMesh::stitch_meshes() while running in parallel (np=4), because the mesh being "attached" in this particular case only has 3 elements total. (The application is constructing a "master" mesh by stitching together several smaller parts, some of which have very few elements...)
The app runs fine in optimized mode, and this makes sense to me in principle, i.e. if you are deep copying one mesh over another, should it matter if they are initially partitioned onto the same number of procs? This also seems like it might be a relatively common use case when stitching meshes (?). So, I'm wondering if the assert is overzealous in at least some use cases of copy_nodes_and_elements() but required in others?
Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh. But I don't see any reason we need the exact same number of partitions, if either that weaker assertion or that wrap-around code is in place instead.
Was there any decision made on a path forward on this?
Was there any decision made on a path forward on this?
I think the simplest approach is to try changing the line above to
libmesh_assert_greater_equal_msg(_n_parts, other_mesh._n_parts);
and seeing if that gets us any farther.
ping
@jwpeterson, @roystgnr - This issue came up on our CASL call today. The Oakridge guys have had to turn off debug builds for BISON because of this issue so we'd like to pin it down and get it fixed. So first things, first, we need a test case to replicate the issue.
@stimpsonsg - Can you tell us how to reproduce? If you have BISON input that does it under certain conditions, I'll start from there and see if I can distill it down into a MOOSE or libMesh only problem and then we should be able to get it fixed.
Was there any consensus on whether or not this might be an over-zealous assertion?
@gardnerru I believe you were the one looking into this problem. Did you ever have a chance to check whether the problem above fixes the issue you were seeing?
Was there any consensus on whether or not this might be an over-zealous assertion?
Yes, read up a few comments for the proposed fix.
Let's make this two fixes? We should switch to libmesh_assert_greater_equal_msg immediately, then if that's insufficient for users in the short term we can add and test the wraparound code later.
Safe to close this, since #1590 fixed the current use case, or should we just mark it as a feature request until we get around to the wraparound case too?
So it looks like this issue came up again, only this time it's the other way around, i.e. the other mesh has more partitions than the incoming mesh, _n_parts < other_mesh._n_parts. Once again the code seems to run fine for us in optimized mode, but the correct thing to do is probably still the 'wrap around' approach suggested by @roystgnr
Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh.
I think on reason this may all just "work" is that after stitching together meshes we still prepare_for_use() the final mesh anyway, which corrects all the partitioning?
@permcody
You're hitting this with stitching of ReplicatedMesh objects? prepare_for_use() might be enough to fix that up, yeah. It almost certainly shouldn't work with a DistributedMesh; processors would try to query other processors which don't exist about the elements they're expected to own.
In this particular case, yes I believe we are using a ReplicatedMesh.
Is there anything left to fix here? It looks to me like #1727 handles the other._n_parts > this->_n_parts case, and #1590 kept us from being overzealous in the reverse case.