libmesh UnstructuredMesh::copy_nodes_and_elements() requires matching _n

We have a case where we're hitting the following assert

https://github.com/libMesh/libmesh/blob/06410d04d134290b911558d1382b559220edfa57/src/mesh/unstructured_mesh.C#L79

in UnstructuredMesh::copy_nodes_and_elements() via a call to ReplicatedMesh::stitch_meshes() while running in parallel (np=4), because the mesh being "attached" in this particular case only has 3 elements total. (The application is constructing a "master" mesh by stitching together several smaller parts, some of which have very few elements...)

The app runs fine in optimized mode, and this makes sense to me in principle, i.e. if you are deep copying one mesh over another, should it matter if they are initially partitioned onto the same number of procs? This also seems like it might be a relatively common use case when stitching meshes (?). So, I'm wondering if the assert is overzealous in at least some use cases of copy_nodes_and_elements() but required in others?

Dec 01 '17 16:12 jwpeterson

Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh. But I don't see any reason we need the exact same number of partitions, if either that weaker assertion or that wrap-around code is in place instead.

Dec 01 '17 16:12 roystgnr

Was there any decision made on a path forward on this?

Dec 13 '17 21:12 stimpsonsg

Was there any decision made on a path forward on this?

I think the simplest approach is to try changing the line above to

libmesh_assert_greater_equal_msg(_n_parts, other_mesh._n_parts);

and seeing if that gets us any farther.

Dec 13 '17 22:12 jwpeterson

ping

Feb 14 '18 20:02 stimpsonsg

@jwpeterson, @roystgnr - This issue came up on our CASL call today. The Oakridge guys have had to turn off debug builds for BISON because of this issue so we'd like to pin it down and get it fixed. So first things, first, we need a test case to replicate the issue.

@stimpsonsg - Can you tell us how to reproduce? If you have BISON input that does it under certain conditions, I'll start from there and see if I can distill it down into a MOOSE or libMesh only problem and then we should be able to get it fixed.

Was there any consensus on whether or not this might be an over-zealous assertion?

Feb 14 '18 22:02 permcody

@gardnerru I believe you were the one looking into this problem. Did you ever have a chance to check whether the problem above fixes the issue you were seeing?

Feb 14 '18 22:02 jwpeterson

Was there any consensus on whether or not this might be an over-zealous assertion?

Yes, read up a few comments for the proposed fix.

Feb 14 '18 22:02 jwpeterson

Let's make this two fixes? We should switch to libmesh_assert_greater_equal_msg immediately, then if that's insufficient for users in the short term we can add and test the wraparound code later.

Feb 15 '18 14:02 roystgnr

Safe to close this, since #1590 fixed the current use case, or should we just mark it as a feature request until we get around to the wraparound case too?

Feb 16 '18 14:02 roystgnr

So it looks like this issue came up again, only this time it's the other way around, i.e. the other mesh has more partitions than the incoming mesh, _n_parts < other_mesh._n_parts. Once again the code seems to run fine for us in optimized mode, but the correct thing to do is probably still the 'wrap around' approach suggested by @roystgnr

Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh.

I think on reason this may all just "work" is that after stitching together meshes we still prepare_for_use() the final mesh anyway, which corrects all the partitioning?

@permcody

Jun 01 '18 16:06 jwpeterson

You're hitting this with stitching of ReplicatedMesh objects? prepare_for_use() might be enough to fix that up, yeah. It almost certainly shouldn't work with a DistributedMesh; processors would try to query other processors which don't exist about the elements they're expected to own.

Jun 01 '18 19:06 roystgnr

In this particular case, yes I believe we are using a ReplicatedMesh.

Jun 01 '18 19:06 jwpeterson

Is there anything left to fix here? It looks to me like #1727 handles the other._n_parts > this->_n_parts case, and #1590 kept us from being overzealous in the reverse case.

Oct 25 '20 20:10 roystgnr

UnstructuredMesh::copy_nodes_and_elements() requires matching _n_parts