llvm-project [flang][OpenMP] Implement more robust loop-nest detection logic

The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of do concurrent loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.

Jul 30 '24 04:07 ergawy

Ping! Please 🙏 take a look when you have time.

Aug 01 '24 05:08 ergawy

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

This happens if cases like the following (see complete sample here):

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = n
  end do

If you look the IR, you will see:

    fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
      ...
      %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
      hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
      hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
      %56 = fir.convert %55 : (i32) -> index
      ...
      fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
        ...
      }
    }

The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.

Aug 02 '24 07:08 ergawy

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:

      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32

before the outermost loop (the i loop)?

Aug 02 '24 07:08 ergawy

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

Aug 02 '24 08:08 mjklemm

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

flang is potentially emitting wrong IR atm (as I mentioned but for different reasons in my previous reply). Nothing is preventing bar from mutating global state (see Fortran code and corresponding HLFIR IR in this comment).

Aug 02 '24 08:08 ergawy

Just as a follow up, if you modify the loop I posted earlier to be:

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = bar(n,m)
  end do

You get:

/tmp/test.f90:15:5: error: Impure procedure 'bar' may not be referenced in DO CONCURRENT
      a(i) = bar(n,m)
      ^^^^^^^^^^^^^^^

So, side-effects are checked only for the body of the loop and not for the bounds calcualations (which are still emitted inside the loop body).

Aug 02 '24 08:08 ergawy

From the spec:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

Aug 02 '24 13:08 ergawy

From the spec:
C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.
which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

I think they mean only the body. The header expressions can be evaluated once before entering any concurrent execution, so should be fine. Unfortunately flang currently doesn't seem to do so atm. Note that is make the way flang emits the loop violate this:

  do concurrent(i=1:n)
    ub = bar(n*m, n/m)
    do concurrent(j=1:ub)
      a(i) = n
    end do
  end do

since the non-pure bar is in the outer loop's body.

Aug 06 '24 12:08 Meinersbur

Update: I had a call with Kiran and Harish, and they will be working on making sure we emit code that is more consistent with the spec.

Aug 16 '24 08:08 ergawy

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

Aug 22 '24 05:08 ergawy

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

Ping Ping! 🔔 Please take a look when you have time 🙏

Sep 05 '24 03:09 ergawy

Thanks all for the review and discussions on this PR. I just merged https://github.com/llvm/llvm-project/pull/114020 which means that do concurrent nests are now more conformant to the spec. This is changes loop nest detection logic in the do concurrent mapping pass; actually it should make it much simpler. However, this means I probably have to abandon this PR. Time and effort was not wasted though, https://github.com/llvm/llvm-project/pull/114020 wouldn't have been introduced without this discussion.

Oct 31 '24 09:10 ergawy

Will close this in favor of https://github.com/ROCm/llvm-project/pull/199. @Meinersbur please take a look, I still cannot add you as a reviewer for some reason.

Nov 01 '24 05:11 ergawy