TraceR icon indicating copy to clipboard operation
TraceR copied to clipboard

Zero-sized message for barrier triggers CODES bug?

Open ptaffet opened this issue 4 years ago • 0 comments

TraceR seems to implement MPI_Barrier as a zero-byte allreduce (https://github.com/hpcgroup/TraceR/blob/develop/tracer/reader/otf2_reader.C#L583 ), which seems like a reasonable implementation. However, at least the fat tree model of CODES, doesn't handle zero byte messages very well.

For example, consider this snippet from https://github.com/codes-org/codes/blob/master/src/networks/model-net/fattree.c#L1811

  if((cur_entry->msg.packet_size % s->params->chunk_size) && (cur_entry->msg.chunk_id == num_chunks - 1)) {
    ts += s->params->head_delay * (cur_entry->msg.packet_size % s->params->chunk_size);
  } else {
    bf->c12 = 1;
    ts += s->params->head_delay * s->params->chunk_size;
  }

If packet_size==0, then the first mod expression evaluates to zero, i.e. false, so a message of zero bytes is treated like a message of chunk_size bytes. This is not so bad, but it means that sending e.g. a 10 byte message is substantially faster than sending a 0 byte message, which is counterintuitive and probably not intended.

I think the easiest way to fix this is to change the line in otf2_reader to implement MPI_Barrier as a small message, maybe 128 bytes. I don't have a good sense for what is realistic.

ptaffet avatar Apr 28 '21 14:04 ptaffet