[flang][OpenMP] Implement more robust loop-nest detection logic
The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of do concurrent loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.
Ping! Please 🙏 take a look when you have time.
I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?
This happens if cases like the following (see complete sample here):
do concurrent(i=1:n, j=1:bar(n*m, n/m))
a(i) = n
end do
If you look the IR, you will see:
fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
...
%53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
%55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
%56 = fir.convert %55 : (i32) -> index
...
fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
...
}
}
The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.
The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...
These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!
If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:
%55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
before the outermost loop (the i loop)?
The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...
These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!
How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.
How would that happen? From a Fortran perspective there should not be any side effects in the code inside the
DO CONCURRENT.
flang is potentially emitting wrong IR atm (as I mentioned but for different reasons in my previous reply). Nothing is preventing bar from mutating global state (see Fortran code and corresponding HLFIR IR in this comment).
Just as a follow up, if you modify the loop I posted earlier to be:
do concurrent(i=1:n, j=1:bar(n*m, n/m))
a(i) = bar(n,m)
end do
You get:
/tmp/test.f90:15:5: error: Impure procedure 'bar' may not be referenced in DO CONCURRENT
a(i) = bar(n,m)
^^^^^^^^^^^^^^^
So, side-effects are checked only for the body of the loop and not for the bounds calcualations (which are still emitted inside the loop body).
From the spec:
C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.
which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?
From the spec:
C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the
concurrent-header?
I think they mean only the body. The header expressions can be evaluated once before entering any concurrent execution, so should be fine. Unfortunately flang currently doesn't seem to do so atm. Note that is make the way flang emits the loop violate this:
do concurrent(i=1:n)
ub = bar(n*m, n/m)
do concurrent(j=1:ub)
a(i) = n
end do
end do
since the non-pure bar is in the outer loop's body.
Update: I had a call with Kiran and Harish, and they will be working on making sure we emit code that is more consistent with the spec.
@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).
But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.
@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of
do concurrent(as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test fileloop_nest_test.f90for cases where we currently fail to detect perfect nests).But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.
Ping Ping! 🔔 Please take a look when you have time 🙏
Thanks all for the review and discussions on this PR. I just merged https://github.com/llvm/llvm-project/pull/114020 which means that do concurrent nests are now more conformant to the spec. This is changes loop nest detection logic in the do concurrent mapping pass; actually it should make it much simpler. However, this means I probably have to abandon this PR. Time and effort was not wasted though, https://github.com/llvm/llvm-project/pull/114020 wouldn't have been introduced without this discussion.
Will close this in favor of https://github.com/ROCm/llvm-project/pull/199. @Meinersbur please take a look, I still cannot add you as a reviewer for some reason.