PSyclone (Closes #1744) Otter tracing linking

This is the first commit towards the ability to use Otter.

So far I added the otter nodes & tests for them. At the moment they do no validation - its in theory tricky to do since they don't have to be in the same function call as the OtterTraceSetupNode - which in of itself is a bit tricky to have PSyclone apply as it probably should be done in the algorithm layer (@rupertford maybe we have to do this manually for now).

The other things that are needed are:

[x] Otter Transformations to create the otter Nodes & corresponding tests. NB some transformations need to apply multiple, e.g. OtterTaskloopTrans needs to both chunk & then apply corresponding otter nodes insidee the chunked loop.
[ ] Ensure use otter_serial is added to any place where an otter function is used.
[ ] Node validation
[ ] Documentation
[ ] pylint & pycodestyle.
[x] NemoLite2D implementation with them.

Jun 07 '22 14:06 LonelyCat124

Codecov Report

Attention: 90 lines in your changes are missing coverage. Please review.

Comparison is base (1d82fdc) 99.38% compared to head (2fd7801) 99.16%. Report is 5026 commits behind head on master.

Files	Patch %	Lines
src/psyclone/psyir/transformations/otter_trans.py	59.27%	90 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1746      +/-   ##
==========================================
- Coverage   99.38%   99.16%   -0.23%     
==========================================
  Files         250      252       +2     
  Lines       37854    38237     +383     
==========================================
+ Hits        37623    37916     +293     
- Misses        231      321      +90

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 07 '22 15:06 codecov[bot]

Would the psydata api be an option for linking in otter, rather than creating specific nodes and transformations? Regarding adding things manually - yes, if things need to be set up once, or a region specified and we can't do (if not setup then setup) then we would have to add them in manually in the algorithm layer for the time being. Note, being able to add things to the algorithm layer is coming - actually I think it is already possible for the gocean api, so you might want to explore that if needed.

Jun 08 '22 20:06 rupertford

Would the psydata api be an option for linking in otter, rather than creating specific nodes and transformations?

I wasn't sure - since otter is mimicking a parallel system, it seemed like it would need to be able to do similar things, e.g. validation in nodes (for example check that a LoopIteration call is inside a Loop). As well since Otter is likely going to evolve over time its unclear what else it may require in the future so I shied away from it for now. If the Psydata API can handle those kinda things then probably it would be better in that style.

Edit: They might not be in to start with because I need a prototype running by start of next month with relatively little time, but I'd like this to be clean enough they can be added in the future.

Jun 08 '22 21:06 LonelyCat124

Would the psydata api be an option for linking in otter, rather than creating specific nodes and transformations?

I wasn't sure - since otter is mimicking a parallel system, it seemed like it would need to be able to do similar things, e.g. validation in nodes (for example check that a LoopIteration call is inside a Loop).

The validation is done in the transformation (I assume), that's independent of the node that is being inserted. So that should work out just fine. You basically just inherit from from one of the PSyData transformations (I am not sure what parameters your API would need, e.g. if you need all arguments used in the kernel, the extraction transformation might(!) provide that already. If you don't need an argument, you could use profile (or psydata directly) - just overwrite the validate() method.

My feeling is that the Profile one would be a good start, assuming that you don't need any of the variables in the otter interface. You can of course query all arguments in the transform, but Profile will create a minimum set of calls: one at the beginning, one at the end of the instrumented region:

      CALL profile_psy_data % PreStart("invoke_1_update_field", "r0", 0, 0)
      DO j = a_fld%whole%ystart, a_fld%whole%ystop, 1
        DO i = a_fld%whole%xstart, a_fld%whole%xstop, 1
          CALL update_field_code(i, j, a_fld%data, b_fld%data, c_fld%data, d_fld%data)
        END DO
      END DO
      CALL profile_psy_data % PostEnd

You would then just need to implement a wrapper library that maps from the psy_data methods to otter API calls ... if that is all you need. It might be a bit confusing to have a profile node in the tree, but on the other hand, no one really sees the tree (and you could always just have a trivial OtterNode class, which just sets appropriate names etc. The underlying psydata transformation takes a class as parameter to indicate which node to insert. The profile transformation is basically empty, it just calls the psydata constructor with the class of the node to insert into the tree.

But obviously, the PSyData API was never developed with any kind of parallelization in mind. E.g. if you need to insert different calls (something like parallel begin, omp do), it might not work (easily), since it is assumed that each psydata method does one and the same thing,

Still, then the implementation of the psydata node might be an easy template on how to create the calls :)

Jun 08 '22 23:06 hiker

The validation is done in the transformation (I assume) My design idea had been to mirror OpenMP, which all does validation at codegen time (avoiding an ordering of transformations). I could probably find a way to do this inside the OtterNode class though maybe though I'm not sure how nice it'd be (since i'd have to search for OtterNode ancestors with the appropriate function call).

I think Otter does plan to support parallelisation, but i'm not sure what you mean - this might be an issue either way for PSyclone as I'm not sure if Calls are allowed inside parallel code.

Another possible issue with PSyData might be this - I had based my implementation off psydata node, but the way Psydata node creates calls wasn't working for me as my calls require non-Fortran standard arguments (or at least Sergi and Andy believed this was the cause). This section of code:

            ParserFactory().create(std="f2008")
            reader = FortranStringReader(
                f"CALL {typename}%{methodname}{argument_str}")
            # Tell the reader that the source is free format
            reader.set_format(FortranFormat(True, False))
            fp2_node = Fortran2003.Call_Stmt(reader)
            return CodeBlock([fp2_node], CodeBlock.Structure.STATEMENT,
                             annotations=annotations)

was not working when I used arguments __FILE__ and __LINE__ as fparser doesn't accept variable names that start with _ as they're not Fortran standard compliant (in this case __FILE__ and __LINE__ are both compiler extensions to do with the preprocessor).

I think PSyData is probably the correct way to want to implement this, but as it is now I think there are a few issues and I don't have time to look into them before the demo next month. I'll also try to have a talk with the Otter devs and see what their plans are moving forward.

Jun 09 '22 07:06 LonelyCat124

Make progress in whatever way works best for you, given the timeline. It can be revisited later. I must admit I'm not sure whether psydata will give you fine-grain enough control over placement of calls if you need to be specific.

Jun 09 '22 11:06 rupertford

One thing I did forget is I need a taskwait-like transformation for this as well.

Edit: This implementation is definitely wrong, working on it in NemoLite2D for now.

Jun 09 '22 18:06 LonelyCat124

Ok, so since Otter uses pre-chunked loops there were complications, some of which may be an issue actually in OpenMP as well.

The first is loop variables need to be ignore in dependency analysis, since their side effects are not allowed to be carried outside of the loop in PSyclone (is this a reasonable thing to say @arporter ? e.g. a loop is always over dimensions in an area of the code? So if we have two loops over some variable contained within other loops (e.g. multidimensional loops) there is never a dependency due to the loop variable). So these are skipped when doing dependency analysis here, and maybe should be for omp_taskwait as well #1338 - perhaps multidimensional loop tests would catch this bug there...

                    elif child is not node and isinstance(child, Loop):
                        loop_ignore_vars.append(str(child.variable.name))

...

                if str(sig1) in loop_ignore_vars:
                    continue

The other thing is the chunked loop bound is also going to potentially appear in multiple chunked loops, despite not actually being a dependency (but is written to each time). I know that the first child of a chunked loop should be this (though I don't think any code actually prevents reordering of this in the PSyIR tree...), so I also skip that when searching for dependencies:

            # This doesn't seem great...we skip the chunk loop var
            loop_ignore_vars.append(node.loop_body.children[0].children[0].symbol.name)

Jun 10 '22 14:06 LonelyCat124

I'm unsure of the details of your use case but @hiker (aided by @sergisiso) have been enhancing the dependence analysis very recently so I'm bringing them in. I've had a quick look at https://github.com/Otter-Taskification/otter to try to better understand your use case. Is it basically adding tasking-aware profiling (i.e. profiling that recognises tasking mechanisms)? Having asked that though, that doesn't seem consistent with your statement that "Otter uses pre-chunked loops" - as in, Otter assumes the loops are chunked or it chunks the loops?

Jun 10 '22 15:06 arporter

Not quite - its actually being treated akin to a parallel system (though it runs serially) to (in the future) simulate the behaviour of a task-based runtime - so while it gathers profiling information, the calls need to end up in a structure as would be valid OpenMP code (approximately, Otter has no concept of OpenMP single) if I replaced all the calls to Otter with the corresponding OpenMP calls.

Jun 10 '22 15:06 LonelyCat124

Adam has let me know of some changes in otter I need to mirror in PSyclone at some point:

I just wanted to let you know about some changes to the Otter API which I'll publish in the next few days, and which will break existing user code. These are the particular changes which will require updates to user code:

Rename otter-serial API entrypoints:
- otterParallel[Begin|End] to otterThreads[Begin|End]
- otterSynchroniseChildTasks to otterSynchroniseTasks
otterSynchroniseTasks now requires an argument to indicate whether to synchronise descendant tasks or only child tasks.

For the argument to otterSynchroniseTasks you should pass the integer value 0 to declare that a barrier synchronises immediate child tasks only, like an OpenMP taskwait barrier, or a value of 1 to synchronise all descendant tasks (like a taskgroup). I've already made (but not yet pushed) these changes to Otter's Fortran bindings.

Jun 28 '22 08:06 LonelyCat124

We're changing how this is done, closing the PR and deleting the branch as we will have a replacement.

Apr 03 '24 15:04 LonelyCat124