core icon indicating copy to clipboard operation
core copied to clipboard

Supporting meshes with the 'no-mesh' attribute on model entities

Open KennethEJansen opened this issue 6 years ago • 5 comments

This is a workflow I have repeated about a dozen times successfully in the past 3 weeks so it was quite a surprise to get this crash.

kjansen@viz003: /projects/tools/Models/JF_TunnelDasha/FullRoom/ChefFlangeRails/16-1-Chef $ ./runChef.sh  16
PUMI Git hash 2.2.0
PUMI version 2.2.0 Git hash 48804e0344328584417a90ee8366c48c98e21e52
"../simMeshToMdsMesh/geomTranslated.smd" and "../simMeshToMdsMesh/geomTranslated.smd" loaded in 0.248826 seconds
mesh ../simMeshToMdsMesh/mdsMesh/ loaded in 139.709668 seconds
number of tet 135321496 hex 0 prism 0 pyramid 0
mesh entity counts: v 22766253 e 158255604 f 270810837 r 135321496
mesh verified in 360.436327 seconds
planned Zoltan split factor 16 to target imbalance 1.010000 in 574.006320 seconds
mesh expanded from 1 to 16 parts in 89.336629 seconds
mesh migrated from 1 to 16 in 7911.473833 seconds
PARMA_STATUS preRefine disconnected <max avg> 1 0.062
PARMA_STATUS preRefine neighbors <max avg> 12 7.625
PARMA_STATUS preRefine smallest side of max neighbor part 11
PARMA_STATUS preRefine num parts with max neighbors 1
PARMA_STATUS preRefine empty parts 0
PARMA_STATUS preRefine small neighbor counts 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 
PARMA_STATUS preRefine weighted vtx <tot max min avg> 23102856.0 1473765.0 1430371.0 1443928.500
PARMA_STATUS preRefine weighted edge <tot max min avg> 159258429.0 9983876.0 9925488.0 9953651.812
PARMA_STATUS preRefine weighted face <tot max min avg> 271477055.0 16990053.0 16951792.0 16967315.938
PARMA_STATUS preRefine weighted rgn <tot max min avg> 135321496.0 8461184.0 8456661.0 8457593.500
PARMA_STATUS preRefine owned bdry vtx <tot max min avg> 331400 42955 0 20712.500
PARMA_STATUS preRefine shared bdry vtx <tot max min avg> 668003 68133 27813 41750.188
PARMA_STATUS preRefine model bdry vtx <tot max min avg> 318131 43258 13097 19883.188
PARMA_STATUS preRefine sharedSidesToElements <max min avg> 0.016 0.007 0.010
PARMA_STATUS preRefine entity imbalance <v e f r>: 1.02 1.00 1.00 1.00
PARMA_STATUS elm imbalance 1.000 avg 8457593.500
PARMA_STATUS max neighbor slope 1.000000 tolerance 0.012000
PARMA_STATUS no targets found... stopping
PARMA_STATUS gap balanced in 0 steps to 1.000509 in 11.879558 seconds
PARMA_STATUS postGap disconnected <max avg> 1 0.062
PARMA_STATUS postGap neighbors <max avg> 12 7.625
PARMA_STATUS postGap smallest side of max neighbor part 11
PARMA_STATUS postGap num parts with max neighbors 1
PARMA_STATUS postGap empty parts 0
PARMA_STATUS postGap small neighbor counts 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 
PARMA_STATUS postGap weighted vtx <tot max min avg> 23102856.0 1473765.0 1430371.0 1443928.500
PARMA_STATUS postGap weighted edge <tot max min avg> 159258429.0 9983876.0 9925488.0 9953651.812
PARMA_STATUS postGap weighted face <tot max min avg> 271477055.0 16990053.0 16951792.0 16967315.938
PARMA_STATUS postGap weighted rgn <tot max min avg> 135321496.0 8461184.0 8456661.0 8457593.500
PARMA_STATUS postGap owned bdry vtx <tot max min avg> 331400 42955 0 20712.500
PARMA_STATUS postGap shared bdry vtx <tot max min avg> 668003 68133 27813 41750.188
PARMA_STATUS postGap model bdry vtx <tot max min avg> 318131 43258 13097 19883.188
PARMA_STATUS postGap sharedSidesToElements <max min avg> 0.016 0.007 0.010
PARMA_STATUS postGap entity imbalance <v e f r>: 1.02 1.00 1.00 1.00
PARMA_STATUS stepFactor 0.300
PARMA_STATUS sideTol 41750
PARMA_ERROR rank 0 comp 1 iso 0 ... some vertices don't have distance computed
false failed at /projects/tools/SCOREC-core/core/parma/diffMC/parma_graphDist.cc + 376 
signal 6 caught by pcu
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(reel_trace+0x13)[0x13cdf63]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef[0x13cdfc5]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f3d22d320e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f3d22d32067]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f3d22d33448]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef[0x13cddd5]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN5parma16measureGraphDistEPN3apf4MeshE+0x7cb)[0x127ac9b]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN5parma11VtxSelectorC2EPN3apf4MeshEPNS1_7MeshTagE+0x43)[0x127d273]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN5parma15makeVtxSelectorEPN3apf4MeshEPNS0_7MeshTagE+0x25)[0x127d2c5]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef[0x128a0ed]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN5parma8Balancer7balanceEPN3apf7MeshTagEd+0x3d)[0x128255d]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN14VtxElmBalancer7balanceEPN3apf7MeshTagEd+0x3a)[0x127e83a]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN2ph8parmaTetERNS_5InputEPN3apf5Mesh2Eb+0xf8)[0x6ff6a8]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN4chef4bakeERP9gmi_modelRPN3apf5Mesh2ERN2ph5InputERNS7_6OutputE+0x196)[0x6e3376]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(_ZN4chef4cookERP9gmi_modelRPN3apf5Mesh2E+0x103)[0x6e3603]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(main+0xb5)[0x6de8c5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f3d22d1eb45]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef[0x6dfe6f]
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 27910 on node viz003 exited on signal 6 (Aborted).

I am sidestepping the issue by setting all the partitioner selector flags to none in a subsequent run. Note that this was a mesh that was generated in Simmetrix with No Mesh applied to two model regions and several model faces and thus required a convert that did not abort as noted in issue 231. However, the "fix" to convert silenced verify entirely so we don't know that mds is completely happy with this mesh (full disclosure). I would have liked to find a way to silence only the non-manifold verification since it is not correct for cases with a model region that was made with the No Mesh attribute set.

KennethEJansen avatar Jun 17 '19 17:06 KennethEJansen

Looks like this is more than a PARMA crash. With partition turned off I get thisjansen@viz003:

/projects/tools/Models/JF_TunnelDasha/FullRoom/ChefFlangeRails/16-1-Chef $ ./runChef.sh  16
PUMI Git hash 2.2.0
PUMI version 2.2.0 Git hash 48804e0344328584417a90ee8366c48c98e21e52
"../simMeshToMdsMesh/geomTranslated.smd" and "../simMeshToMdsMesh/geomTranslated.smd" loaded in 0.256165 seconds
mesh ../simMeshToMdsMesh/mdsMesh/ loaded in 142.409646 seconds
number of tet 135321496 hex 0 prism 0 pyramid 0
mesh entity counts: v 22766253 e 158255604 f 270810837 r 135321496
mesh verified in 371.629586 seconds
planned Zoltan split factor 16 to target imbalance 1.010000 in 591.782465 seconds
mesh expanded from 1 to 16 parts in 93.909752 seconds
mesh migrated from 1 to 16 in 3177.871405 seconds
mesh reordered in 88.657689 seconds
max vertex load imbalance of partitioned mesh = 1.020663
ratio of sum of all vertices to sum of owned vertices = 1.014785
max region (3D) or face (2D) load imbalance of partitioned mesh = 1.000425
i < s.n failed at /projects/tools/SCOREC-core/core/mds/apfMDS.cc + 331 
signal 6 caught by pcu
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef(reel_trace+0x13)[0x13cdf63]
/projects/tools/SCOREC-core/build-14-190604dev_omp110/test/chef[0x13cdfc5]

and that chunk of code is

 326     MeshEntity* getUpward(MeshEntity* e, int i)
 327     {
 328       mds_set s;
 329       mds_id id = fromEnt(e);
 330       mds_get_adjacent(&(mesh->mds),id,mds_dim[mds_type(id)] + 1,&s);
 331       PCU_ALWAYS_ASSERT(i < s.n);
 332       return toEnt(s.e[i]);
 333     }

KennethEJansen avatar Jun 17 '19 17:06 KennethEJansen

The good news is that setting elementsPerMigration 100000

up by a factor of 10 from the previous run reduced that step from 7911 seconds to 3177 seconds so retries are improved (still a long wait).

KennethEJansen avatar Jun 17 '19 17:06 KennethEJansen

@cwsmith, please let me know if this issue is something I have to take a look. I have not run the phasta before so your input will be appreciated.

seegyoung avatar Jun 17 '19 17:06 seegyoung

I have another route to try so this is not yet urgent from my perspective. Heather just told me how to delete the model regions that I applied the No Mesh attribute for so this may side step the problem in the short term. In the long term, I guess SCOREC will have to decide if they want to support meshes on models where one or more regions is present in the model but not meshed.

KennethEJansen avatar Jun 17 '19 17:06 KennethEJansen

With the model regions deleted and abort_on_error set back to true (and recompiled of course), convert and Chef are happy on a "scout" mesh (no boundary layers dropped the mesh size to 5M). I expect the 135M with BL's will also go through fine. After Cameron gets some fires put out we should discuss whether SCOREC/core wants to support the No Mesh option or not.

KennethEJansen avatar Jun 17 '19 19:06 KennethEJansen