Study performance impact of avoiding doing array broadcasts via PE0
Currently, basically all array broadcasts go via PE0; the message is sent there first, and then PE 0 sends out to the array (note that the zero copy post pathway doesn't quite do this, it instead does the scheme I'm about to propose in the next sentence). Instead, one thing we could do is send a small message requesting a serialization number from PE0, then once we get that number back, do the broadcast from the original originating PE/object. This adds an extra hop (albeit with less of a payload: two tiny messages vs. one message with the data to be broadcast), but many benefits: less reliance and potential of a bottleneck on 0, less network load since different spanning trees will be used, the ability for PEs/Nodes to process local broadcasts while waiting on remote broadcasts (right now only PE/Node 0 can have the benefit, this would allow the originator, which could be any PE/Node, to do so), etc.
One potential wrinkle is memory footprint, since now each PE/Node would have to store its portion of multiple spanning trees , each rooted at a different PE/Node.
Another benefit would be that if the serialization number is made atomic, the comm thread can update the number without having to go through the PE 0 scheduler. Currently if PE 0 has lots of load the broadcast can be delayed by that.
Another benefit would be that if the serialization number is made atomic, the comm thread can update the number without having to go through the PE 0 scheduler. Currently if PE 0 has lots of load the broadcast can be delayed by that.
It actually is an atomic right now, but I don't think it actually exploits that right now (i.e. it's not an [immediate], and it's also not a housed in a NodeGroup). So that's also something that would be worth investigating and fixing.