libmesh icon indicating copy to clipboard operation
libmesh copied to clipboard

UnstructuredMesh::copy_nodes_and_elements() requires matching _n_parts

Open jwpeterson opened this issue 8 years ago • 13 comments

We have a case where we're hitting the following assert

https://github.com/libMesh/libmesh/blob/06410d04d134290b911558d1382b559220edfa57/src/mesh/unstructured_mesh.C#L79

in UnstructuredMesh::copy_nodes_and_elements() via a call to ReplicatedMesh::stitch_meshes() while running in parallel (np=4), because the mesh being "attached" in this particular case only has 3 elements total. (The application is constructing a "master" mesh by stitching together several smaller parts, some of which have very few elements...)

The app runs fine in optimized mode, and this makes sense to me in principle, i.e. if you are deep copying one mesh over another, should it matter if they are initially partitioned onto the same number of procs? This also seems like it might be a relatively common use case when stitching meshes (?). So, I'm wondering if the assert is overzealous in at least some use cases of copy_nodes_and_elements() but required in others?

jwpeterson avatar Dec 01 '17 16:12 jwpeterson

Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh. But I don't see any reason we need the exact same number of partitions, if either that weaker assertion or that wrap-around code is in place instead.

roystgnr avatar Dec 01 '17 16:12 roystgnr

Was there any decision made on a path forward on this?

stimpsonsg avatar Dec 13 '17 21:12 stimpsonsg

Was there any decision made on a path forward on this?

I think the simplest approach is to try changing the line above to

libmesh_assert_greater_equal_msg(_n_parts, other_mesh._n_parts);

and seeing if that gets us any farther.

jwpeterson avatar Dec 13 '17 22:12 jwpeterson

ping

stimpsonsg avatar Feb 14 '18 20:02 stimpsonsg

@jwpeterson, @roystgnr - This issue came up on our CASL call today. The Oakridge guys have had to turn off debug builds for BISON because of this issue so we'd like to pin it down and get it fixed. So first things, first, we need a test case to replicate the issue.

@stimpsonsg - Can you tell us how to reproduce? If you have BISON input that does it under certain conditions, I'll start from there and see if I can distill it down into a MOOSE or libMesh only problem and then we should be able to get it fixed.

Was there any consensus on whether or not this might be an over-zealous assertion?

permcody avatar Feb 14 '18 22:02 permcody

@gardnerru I believe you were the one looking into this problem. Did you ever have a chance to check whether the problem above fixes the issue you were seeing?

jwpeterson avatar Feb 14 '18 22:02 jwpeterson

Was there any consensus on whether or not this might be an over-zealous assertion?

Yes, read up a few comments for the proposed fix.

jwpeterson avatar Feb 14 '18 22:02 jwpeterson

Let's make this two fixes? We should switch to libmesh_assert_greater_equal_msg immediately, then if that's insufficient for users in the short term we can add and test the wraparound code later.

roystgnr avatar Feb 15 '18 14:02 roystgnr

Safe to close this, since #1590 fixed the current use case, or should we just mark it as a feature request until we get around to the wraparound case too?

roystgnr avatar Feb 16 '18 14:02 roystgnr

So it looks like this issue came up again, only this time it's the other way around, i.e. the other mesh has more partitions than the incoming mesh, _n_parts < other_mesh._n_parts. Once again the code seems to run fine for us in optimized mode, but the correct thing to do is probably still the 'wrap around' approach suggested by @roystgnr

Hmm... we either need the communicator on the target mesh to have at least as many ranks as the number of partitions on the source mesh, or we need to "wrap around" any source DofObject processor_id() which equals or exceeds n_processors() on the target mesh.

I think on reason this may all just "work" is that after stitching together meshes we still prepare_for_use() the final mesh anyway, which corrects all the partitioning?

@permcody

jwpeterson avatar Jun 01 '18 16:06 jwpeterson

You're hitting this with stitching of ReplicatedMesh objects? prepare_for_use() might be enough to fix that up, yeah. It almost certainly shouldn't work with a DistributedMesh; processors would try to query other processors which don't exist about the elements they're expected to own.

roystgnr avatar Jun 01 '18 19:06 roystgnr

In this particular case, yes I believe we are using a ReplicatedMesh.

jwpeterson avatar Jun 01 '18 19:06 jwpeterson

Is there anything left to fix here? It looks to me like #1727 handles the other._n_parts > this->_n_parts case, and #1590 kept us from being overzealous in the reverse case.

roystgnr avatar Oct 25 '20 20:10 roystgnr