Implements a Loop Fusion Transformation
Loopy-flavored loop-fusion transformation corresponding to https://doi.org/10.1007/3-540-57659-2_18.
This could be rebased now that the prerequisite generate_loop_schedule_v2 is in.
FYI @kaushikcfd, while I was browsing through this code the other day trying to understand a warning that was being emitted (which turned into inducer/meshmode#453), I spotted a few opportunities to avoid recomputation and speed things up a fair amount in get_kennedy_unweighted_fusion_candidates. Specifically, the calls I noticed that were being repeated were _get_partial_loop_nest_tree_for_fusion, _get_ldg_nodes_from_loopy_insn, and (I think, need to revisit and confirm) get_insn_access_map. If I can find some time this week I'll finish my changes and push them for you take a look at.
@majosm: Thanks for the potential bottlenecks. I memoized those routines.
Pushed some cosmetic fixes. This was complex to review, but I think I've got a decent understanding of it now. LGTM, in it goes!