Summary

This PR adds alternative types and specializations to structs and templates that involve the description of tensor bounds and strides. These alternatives and specializations represent these offsets, sizes, and strides statically though template arguments and static const values. This representation allows the calculation of these values to be inlined by the compiler when indexing tensors within a view.

The Problem being Addressed

When indexing tensors from a view, even if the layout and indexes are predetermined at compile time, the compiler uses the member arrays of TensorRef and TensorTileRef to calculate the offsets of the indexed data within the view. This is the case even with aggressive optimizations, and so leaves performance on the table by forcing the values in the TensorRef's arrays to be loaded to calculate each index.

To eliminate this additional loading, the information contained within these arrays (strides,offsets,sizes) would need to be represented statically not only within TensorRef, but within every function, type, and template used to get from an index operation all the way to its return. The types included in this PR serve this purpose by providing alternatives along each link in this chain of abstractions, allowing for both indexes and layouts to be represented statically or non-statically in any combination (including within expression templates) while preserving functionality.

Added Types

StaticTensorIndex : Allows users to provide a compile-time representation of tensor indexes
[helper type] StaticTensorIndexInner: Used by StaticTensorIndex so that templates which specialize based off of StaticTensorIndex but not the internal types/values can simply insert a typename parameter when referring to its specialization
StaticTensorRef: The return type when indexing tensors in a view with a StaticLayout using StaticTensorIndexs
[helper type] RefBridge: Prior to this PR, specializations of TensorRegister only provide implementations of load_ref and store_ref to TensorRefs of the appropriate dimensionality. To extend these methods to include StaticTensorRef while maintaining this constraint, this inner type is required.
StaticTensorTileRef: Used internally by StaticTensorRef and behaves as if it was a TensorTileRef, apart from not having modifiable begin and size arrays
[helper type] StaticIndexArray: Used internally by StaticTensorRef and StaticTensorTileRef it is a wrapper type around a camp::int_seq that pretends to be an array containing its argument list as values. By using this type, functions originally meant to work with TensorRef and TensorTileRef (and which don't need to modify the underlying data) can treat their static alternatives identically. In cases where array data does need to be altered, helper types such as SetStaticIndexArray allow for the creation of new specializations of StaticIndexArray with certain elements set to different values.
[specialization] TensorIndexTraits<StaticTensorIndex<...>>: Required for other code to treat StaticTensorIndex as though it were TensorIndex
[specialization] MergeRefTile<StaticTensorRef<...>,StaticTensorTile<...>,camp::int_seq<...>>: Implements compile-time merging of refs and tiles only if both are static
StaticTensorTileExec: Unrolls the iteration that comes with executing a body upon a tile, used when the tile is a StaticTensorTile
[specialization] ViewReturnHelper: Peforms indexing of views with StaticLayoutBase_impl as a layout type and StaticTensorIndex as the index type at compile time

Testing

These changes have been tested - and completely passed - for the following platforms:

toss3(quartz) : gcc, clang, icpc
blueos(lassen): nvcc_gcc, nvcc_xl

Performance

Preliminary testing has shown performance approaching 0.70 to 0.85 times the performance achieved with intrinsics (as opposed to the ~0.5 times performance seen without the use of these static facilities).

It seems likely that further development could produce additional benefits. This is because the 0.85-times performance corresponds to the use of static types for all layouts and indexes except for the layout of the assigned view in an expression template. Making this last layout static decreases the performance to 0.70 times the intrinsic performance. However, making any other value non-static provides even worse performance - though nothing worse than the base non-static performance.

This seems to indicate some optimization is yet required. Nonetheless - even without this optimization - a 1.4-to-1.7 times speedup is an improvement.

Design review

Most of the types added are for internal use, as they mirror the functionality of internal types themselves. However, the StaticTensorIndex template struct should be scrutinized for usability. Currently, to aid in the construction of StaticTensorIndex specializations, the static template methods static_range and static_all have been added to the TensorIndex type. These template methods mirror the usage of the range and all methods, but their inputs are provided as template arguments.

An alternative implementation would be to make range and all template methods, give them normal behavior when template arguments are not provided, and have them act like static_range and static_all if they are. While this is feasible with the use of arrow notation to change the return type based off of the specialization of the method, this change may make debugging more difficult for developers if the function is not used properly. Particularly, making these methods templates increases the likelihood developers would run into unfriendly mountains of template error messages if they do not use such methods properly.

Jun 29 '22 03:06 braxtoncuneo

I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.

Jun 29 '22 22:06 trws

sigh sorry about that, misclick 🤦

Jun 29 '22 22:06 trws

I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.

@braxtoncuneo and I will go over these changes at the next RAJA meeting 7/5. Hopefully you and @ajkunen can make it, and give us some feedback.

We still need to add tests, but Braxton had checked that the changes do not break existing tests.

Jun 29 '22 23:06 rchen20

That sounds great, I intend to be around for that meeting. The main question I’ll be thinking about when looking at this is how much of this we can fold out so users get this functionality automatically from the existing types when specifying static bounds. I don’t think we’ll get all of it, so the added types are probably a good idea, but I think some of the duplication might be able to fold out with some judicious use of C++14-extended constexpr.

Jun 29 '22 23:06 trws

This latest push rebases off of the most recent version of develop and folds some minor commits into larger ones.

Aug 26 '22 23:08 braxtoncuneo

This latest push rebases off of the most recent version of develop and folds some minor commits into larger ones.

@braxtoncuneo Is it safe for me to make more test modifications to this branch now?

Aug 26 '22 23:08 rchen20

@ajkunen @trws Existing tests have been updated to use the new StaticLayout with vectorization. Please take a look at this PR when you have a chance.

Sep 10 '22 00:09 rchen20

@ajkunen @trws @davidbeckingsale Would you mind taking another look at this? Thanks!

Sep 20 '22 18:09 rchen20

This looks OK to me. I would like to get one or two more approvals before merging. Please take a look and approve or comment on changes you think should be made. Thank you.

Oct 12 '22 16:10 rhornung67

@MrBurmark @ajkunen @trws do you want to take another look before merging this?

Oct 12 '22 17:10 rhornung67

Added Static Versions of TensorRef, TensorTileRef, VectorIndex, etc

Summary

The Problem being Addressed

Added Types

Testing

Performance

Design review