Overlap communication with computation in multiply_module
Fixes #265
From @davidbowler: compute is only being called on kpart [2 : end]. To fix, call compute kernels on kpart -1 then call once after the loop on kpart.
I think this can be reviewed now. I'll produce some profiles, to see if we gained anything, when I'm back from the holidays, I don't think I'll have time tomorrow.
There's no performance improvement seen, if anything there's a small degradation (see below). I think I understand why: the problem is the order communications are received, and not the time they take.
Will not merge for now, and instead investigate optimising the order in https://github.com/OrderN/CONQUEST-release/tree/ic-mm-comms-optimise-order . If that works, then we can revisit overlapping comms with computation, for further improvement.