Added Static Versions of TensorRef, TensorTileRef, VectorIndex, etc
Summary
This PR adds alternative types and specializations to structs and templates that involve the description of tensor bounds and strides. These alternatives and specializations represent these offsets, sizes, and strides statically though template arguments and static const values. This representation allows the calculation of these values to be inlined by the compiler when indexing tensors within a view.
The Problem being Addressed
When indexing tensors from a view, even if the layout and indexes are predetermined at compile time, the compiler uses the member arrays of TensorRef and TensorTileRef to calculate the offsets of the indexed data within the view. This is the case even with aggressive optimizations, and so leaves performance on the table by forcing the values in the TensorRef's arrays to be loaded to calculate each index.
To eliminate this additional loading, the information contained within these arrays (strides,offsets,sizes) would need to be represented statically not only within TensorRef, but within every function, type, and template used to get from an index operation all the way to its return. The types included in this PR serve this purpose by providing alternatives along each link in this chain of abstractions, allowing for both indexes and layouts to be represented statically or non-statically in any combination (including within expression templates) while preserving functionality.
Added Types
-
StaticTensorIndex: Allows users to provide a compile-time representation of tensor indexes - [helper type]
StaticTensorIndexInner: Used byStaticTensorIndexso that templates which specialize based off ofStaticTensorIndexbut not the internal types/values can simply insert atypenameparameter when referring to its specialization -
StaticTensorRef: The return type when indexing tensors in a view with aStaticLayoutusingStaticTensorIndexs - [helper type]
RefBridge: Prior to this PR, specializations ofTensorRegisteronly provide implementations ofload_refandstore_reftoTensorRefsof the appropriate dimensionality. To extend these methods to includeStaticTensorRefwhile maintaining this constraint, this inner type is required. -
StaticTensorTileRef: Used internally byStaticTensorRefand behaves as if it was aTensorTileRef, apart from not having modifiable begin and size arrays - [helper type]
StaticIndexArray: Used internally byStaticTensorRefandStaticTensorTileRefit is a wrapper type around acamp::int_seqthat pretends to be an array containing its argument list as values. By using this type, functions originally meant to work withTensorRefandTensorTileRef(and which don't need to modify the underlying data) can treat their static alternatives identically. In cases where array data does need to be altered, helper types such asSetStaticIndexArrayallow for the creation of new specializations ofStaticIndexArraywith certain elements set to different values. - [specialization]
TensorIndexTraits<StaticTensorIndex<...>>: Required for other code to treatStaticTensorIndexas though it wereTensorIndex - [specialization]
MergeRefTile<StaticTensorRef<...>,StaticTensorTile<...>,camp::int_seq<...>>: Implements compile-time merging of refs and tiles only if both are static -
StaticTensorTileExec: Unrolls the iteration that comes with executing a body upon a tile, used when the tile is aStaticTensorTile - [specialization]
ViewReturnHelper: Peforms indexing of views withStaticLayoutBase_implas a layout type andStaticTensorIndexas the index type at compile time
Testing
These changes have been tested - and completely passed - for the following platforms:
- toss3(quartz) : gcc, clang, icpc
- blueos(lassen): nvcc_gcc, nvcc_xl
Performance
Preliminary testing has shown performance approaching 0.70 to 0.85 times the performance achieved with intrinsics (as opposed to the ~0.5 times performance seen without the use of these static facilities).
It seems likely that further development could produce additional benefits. This is because the 0.85-times performance corresponds to the use of static types for all layouts and indexes except for the layout of the assigned view in an expression template. Making this last layout static decreases the performance to 0.70 times the intrinsic performance. However, making any other value non-static provides even worse performance - though nothing worse than the base non-static performance.
This seems to indicate some optimization is yet required. Nonetheless - even without this optimization - a 1.4-to-1.7 times speedup is an improvement.
Design review
Most of the types added are for internal use, as they mirror the functionality of internal types themselves. However, the StaticTensorIndex template struct should be scrutinized for usability. Currently, to aid in the construction of StaticTensorIndex specializations, the static template methods static_range and static_all have been added to the TensorIndex type. These template methods mirror the usage of the range and all methods, but their inputs are provided as template arguments.
An alternative implementation would be to make range and all template methods, give them normal behavior when template arguments are not provided, and have them act like static_range and static_all if they are. While this is feasible with the use of arrow notation to change the return type based off of the specialization of the method, this change may make debugging more difficult for developers if the function is not used properly. Particularly, making these methods templates increases the likelihood developers would run into unfriendly mountains of template error messages if they do not use such methods properly.
I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.
sigh sorry about that, misclick 🤦
I'll try to give this a more in-depth review, but as a start we definitely need some tests for this code. You mention that it has been tested, but it doesn't seem to be hooked into any of the raja tests yet.
@braxtoncuneo and I will go over these changes at the next RAJA meeting 7/5. Hopefully you and @ajkunen can make it, and give us some feedback.
We still need to add tests, but Braxton had checked that the changes do not break existing tests.
That sounds great, I intend to be around for that meeting. The main question I’ll be thinking about when looking at this is how much of this we can fold out so users get this functionality automatically from the existing types when specifying static bounds. I don’t think we’ll get all of it, so the added types are probably a good idea, but I think some of the duplication might be able to fold out with some judicious use of C++14-extended constexpr.
This latest push rebases off of the most recent version of develop and folds some minor commits into larger ones.
This latest push rebases off of the most recent version of
developand folds some minor commits into larger ones.
@braxtoncuneo Is it safe for me to make more test modifications to this branch now?
@ajkunen @trws Existing tests have been updated to use the new StaticLayout with vectorization. Please take a look at this PR when you have a chance.
@ajkunen @trws @davidbeckingsale Would you mind taking another look at this? Thanks!
This looks OK to me. I would like to get one or two more approvals before merging. Please take a look and approve or comment on changes you think should be made. Thank you.
@MrBurmark @ajkunen @trws do you want to take another look before merging this?