ma_refine_number leads to crash
This was the most stable phastaChef case of the many we have tried. Then, after more than 200 adapt cycles I get this:
86 rank 5261 badflag xyz -0.001536249417 0.006062729008 0 -0.001519191546 0.006013116779 0
87 rank 5261 badflag xyz -0.00144037133 0.006048323858 0 -0.001536249417 0.006062729008 0
88 rank 3061 badflag xyz -0.001519190288 0.006013112085 0.025 -0.001488309744 0.006055524086 0.02 5
89 rank 3061 badflag xyz -0.001488309744 0.006055524086 0.025 -0.001584186574 0.006069924541 0.02 5
90 expected tag "ma_refine_number" on entity type 2
91 expected tag "ma_refine_number" on entity type 2
92 signal 6 caught by pcu
93 signal 6 caught by pcu
The first 4 lines indicate a new failure in checkFlagConsistency. You might recall we had a problem with tags not matching for coarsening so Cameron helped me write a hack that, rather than aborting on an unmatched tag, it made it match. This allowed us to do the 200 successful adapts. Full disclosure, somewhere along the way, I also got failures in non-matching refinement tags. So I expanded the hack to deal with that (seemed to be a rare event). So, with the new failure I decided to fix ANY unmatched tag like this:
253 if(value != getFlag(a,e,flag)){ //detect mismatch
254 if(1) { // for COLLAPSE we will clear inconsistency and carry on
255 // if(flag==SPLIT || flag==COLLAPSE) { // for COLLAPSE we will clear inconsistency and carry on
256 entFlagsToFlip.push_back(e); //buffer the mismatched ents
257 if(getFlag(a,e,flag) != 1) clearFlag(a,e,flag);
258 Entity* v[2]; // find and print the coordinates to stderr so we know where fixes applied
259 m->getDownward(e,0,v);
260 Vector x1 = getPosition(m, v[0]);
261 Vector x2 = getPosition(m, v[1]);
262 fprintf(stderr, "rank %d badflag xyz %15.10g %15.10g %15.10g %15.10g %15.10g %15.10g \n", PCU_Comm_Self(), x1.x(), x1.y(), x1.z(), x2.x (), x2.y(), x2.z());
263 } else {
264 ok = false; // for other flags maintain the checkAndQuit approach
265 }
of course the real fix is below this but using the if(1) I think I should now be requesting any missmatched tag to get fixed.
This did not work though so it appears this bandaid doesn't extend to this new issue. Can someone familiar with the code clear up what is going on so that hopefully Cameron and I can figure out if another fix is possible?
@KennethEJansen Can you paste the full code of the hack (including the PCU send/recv portion)?
The print statement could also include the flag name: https://github.com/SCOREC/core/blob/4c4ba244022cce58d192de4834db2e9861fc924d/apf/apfMesh.h#L275 as done here: https://github.com/SCOREC/core/blob/59d03165714ac2c2069f4de6a3d08bd3ba041d20/apf/apfMesh.cc#L1140-L1141
Full code hack as per cws request:
227 bool checkFlagConsistency(Adapt* a, int dimension, int flag)
228 {
229 Mesh* m = a->mesh;
230 apf::Sharing* sh = apf::getSharing(m);
231 PCU_Comm_Begin();
232 Entity* e;
233 Iterator* it = m->begin(dimension);
234 while ((e = m->iterate(it))) {
235 apf::CopyArray others;
236 sh->getCopies(e, others);
237 if (!others.getSize())
238 continue;
239 bool value = getFlag(a, e, flag);
240 APF_ITERATE(apf::CopyArray, others, rit) {
241 PCU_COMM_PACK(rit->peer, rit->entity);
242 PCU_COMM_PACK(rit->peer, value);
243 }
244 }
245 m->end(it);
246 PCU_Comm_Send();
247 bool ok = true;
248 std::vector<Entity*> entFlagsToFlip;
249 while (PCU_Comm_Receive()) {
250 PCU_COMM_UNPACK(e);
251 bool value;
252 PCU_COMM_UNPACK(value);
253 if(value != getFlag(a,e,flag)){ //detect mismatch
254 if(1) { // for COLLAPSE we will clear inconsistency and carry on
255 // if(flag==SPLIT || flag==COLLAPSE) { // for COLLAPSE we will clear inconsistency and carry on
256 entFlagsToFlip.push_back(e); //buffer the mismatched ents
257 if(getFlag(a,e,flag) != 1) clearFlag(a,e,flag);
258 Entity* v[2]; // find and print the coordinates to stderr so we know where fixes applied
259 m->getDownward(e,0,v);
260 Vector x1 = getPosition(m, v[0]);
261 Vector x2 = getPosition(m, v[1]);
262 fprintf(stderr, "rank %d badflag xyz %15.10g %15.10g %15.10g %15.10g %15.10g %15.10g \n", PCU_Comm_Self(), x1.x(), x1.y(), x1.z(), x2.x (), x2.y(), x2.z());
263 } else {
264 ok = false; // for other flags maintain the checkAndQuit approach
265 }
266 }
267 }
268
269 //begin another communication round
270 PCU_Comm_Begin();
271 //loop over the ents that need the flags flipped/cleared
272 for(int i=0; i<entFlagsToFlip.size(); i++) {
273 e = entFlagsToFlip[i]; //get the ent
274 apf::CopyArray others;
275 sh->getCopies(e, others);
276 if (!others.getSize())
277 continue;
278 //this ent has remote copies
279 APF_ITERATE(apf::CopyArray, others, rit) {
280 //pack the remote copy
281 PCU_COMM_PACK(rit->peer, rit->entity);
282 }
283 }
284 // crashed with this still here so commenting m->end(it);
285 //send all the packed messages
286 PCU_Comm_Send();
287 //listen for incoming messages
288 while (PCU_Comm_Receive()) {
289 //unpack the entity
290 PCU_COMM_UNPACK(e);
291 //clear the flag
292 // assert(flag==COLLAPSE);
293 clearFlag(a,e,flag);
294 }
295
296 delete sh;
297 return ok;
298 }
code markdown fail
bool checkFlagConsistency(Adapt* a, int dimension, int flag)
{
Mesh* m = a->mesh;
apf::Sharing* sh = apf::getSharing(m);
PCU_Comm_Begin();
Entity* e;
Iterator* it = m->begin(dimension);
while ((e = m->iterate(it))) {
apf::CopyArray others;
sh->getCopies(e, others);
if (!others.getSize())
continue;
bool value = getFlag(a, e, flag);
APF_ITERATE(apf::CopyArray, others, rit) {
PCU_COMM_PACK(rit->peer, rit->entity);
PCU_COMM_PACK(rit->peer, value);
}
}
m->end(it);
PCU_Comm_Send();
bool ok = true;
std::vector<Entity*> entFlagsToFlip;
while (PCU_Comm_Receive()) {
PCU_COMM_UNPACK(e);
bool value;
PCU_COMM_UNPACK(value);
if(value != getFlag(a,e,flag)){ //detect mismatch
if(1) { // for COLLAPSE we will clear inconsistency and carry on
// if(flag==SPLIT || flag==COLLAPSE) { // for COLLAPSE we will clear inconsistency and carry on
entFlagsToFlip.push_back(e); //buffer the mismatched ents
if(getFlag(a,e,flag) != 1) clearFlag(a,e,flag);
Entity* v[2]; // find and print the coordinates to stderr so we know where fixes applied
m->getDownward(e,0,v);
Vector x1 = getPosition(m, v[0]);
Vector x2 = getPosition(m, v[1]);
fprintf(stderr, "rank %d badflag xyz %15.10g %15.10g %15.10g %15.10g %15.10g %15.10g \n", PCU_Comm_Self(), x1.x(), x1.y(), x1.z(), x2.x(), x2.y(), x2.z());
} else {
ok = false; // for other flags maintain the checkAndQuit approach
}
}
}
//begin another communication round
PCU_Comm_Begin();
//loop over the ents that need the flags flipped/cleared
for(int i=0; i<entFlagsToFlip.size(); i++) {
e = entFlagsToFlip[i]; //get the ent
apf::CopyArray others;
sh->getCopies(e, others);
if (!others.getSize())
continue;
//this ent has remote copies
APF_ITERATE(apf::CopyArray, others, rit) {
//pack the remote copy
PCU_COMM_PACK(rit->peer, rit->entity);
}
}
// crashed with this still here so commenting m->end(it);
//send all the packed messages
PCU_Comm_Send();
//listen for incoming messages
while (PCU_Comm_Receive()) {
//unpack the entity
PCU_COMM_UNPACK(e);
//clear the flag
// assert(flag==COLLAPSE);
clearFlag(a,e,flag);
}
delete sh;
return ok;
}
I don't think fixing/changing the flag values inside checkFlagConsistency is the best way to go.
@KennethEJansen , would it be possible for you to send me the mesh (a .smb file) and the requested size field (this can be added as a field to the mesh), right before the adapt step that the above failure happens? The reason, I am asking for this is that I have a setup and with the above information I can start the debugging process very quickly.
I just realized that might not be feasible because you have a many-part mesh.
yes it is in 32768 parts.
Cameron, regarding your request for tag, I maybe confused. Is the tag you are asking for different or the same from what is in flag? Previously we were checking for flag==SPLIT. I think we can easily print the value of flag but I am unsure about tag. If tag is really what we want, I guess we need to know what to set t to in the code fragment provided.
I looked into this a bit more. The function you pointed to expects t to be a MeshTag. I am not sure how to get that. Guessed a few things but failed....
a->flagsTag should give you integer tag.
I don't know if this will be helpful or not but we have two functions
-
int getFlags(Adapt* a, Entity* e)here https://github.com/SCOREC/core/blob/master/ma/maAdapt.cc#L81 which returns an int containing all the flags (i.e., COLLAPSE, SPLIT, etc). (each bit in the integer corresponds to a flag see here https://github.com/SCOREC/core/blob/master/ma/maAdapt.h#L17 for the bit locations) -
bool getFlag(Adapt* a, Entity* e, int flag)here https://github.com/SCOREC/core/blob/master/ma/maAdapt.cc#L96 which checks for a specific flag.
There are also the corresponding set functions for each of the above.
So, it might be helpful to use the int value (obtained by the first functions) instead of the bool (obtained by the second function) when you force the flags for the matched entities in the hacked code. That way you can make sure all the flags are matched for a given entity.
I added the writing of flag after "badflag" and this is what I get.
MeshAdapt: version 2.0 !
MeshAdapt: iteration 0
rank 5589 badflag 4 xyz -0.002586218218 0.009651897791 0.0008164286753 -0.002586218218 0.009651897791 0.0008114288737
rank 4734 badflag 4 xyz 0.002586218218 0.009651897791 0.000816432012 0.002586218218 0.009651897791 0.000811424012
rank 6745 badflag 4 xyz 0.002586218218 0.009651897791 0.000816432012 0.002586218218 0.009651897791 0.000811424012
rank 5589 badflag 4 xyz -0.002586218218 0.009651897791 0.0008164286753 -0.002586218218 0.009651897791 0.0008114288737
MeshAdapt: coarsened 1508689 edges in 644.529662 seconds
rank 5261 badflag 1 xyz -0.001536249417 0.006062729008 0 -0.001519191546 0.006013116779 0
rank 5261 badflag 1 xyz -0.00144037133 0.006048323858 0 -0.001536249417 0.006062729008 0
rank 3061 badflag 1 xyz -0.001519190288 0.006013112085 0.025 -0.001488309744 0.006055524086 0.025
rank 3061 badflag 1 xyz -0.001488309744 0.006055524086 0.025 -0.001584186574 0.006069924541 0.025
expected tag "ma_refine_number" on entity type 2
expected tag "ma_refine_number" on entity type 2
signal 6 caught by pcu
signal 6 caught by pcu
From looking at the flag defs that Morteza pointed me to the latter ones that lead to a crash seem to be related to SPLIT which I think I had successfully survived before so this is perhaps not the issue and just a coincidence that the these flags are getting triggered just before the crash?
Or maybe I was just lucky to get them to survive before by adding the SPLIT case to the COLLAPSE case for making tags match?