Notation for tensor and matrix dimensions are inconsistent
In NCHW tensor notation, the last dimension is the contiguous dimension. For column-major matrix notation, the first dimension is the contiguous dimension. We haven't needed to think that much about this since our data samples are usually 1D or 3D, but with transformers we need to do batched matrix multiplication. We should settle this question and commit to a consistent scheme to avoid confusion.
As much as it pains me coming from an applied math background, I think we should switch to C/row-major/NCHW notation. It matches PyTorch, TensorFlow, and NumPy and seems to be more natural for practitioners.
Whatever we decide, DiHydrogen should use the same scheme as LBANN. Pinging @benson31, @naoyam, @ndryden.
We should, ideally, be flexible with regard to data layouts.
For example, convolution in NHWC layout is preferred for using Tensor Cores. It may make sense to transpose layouts between different layers for optimal performance. And so on.
NCHW vs. NHWC is somewhat orthogonal to my concern, since both are using C-style tensor notation. I'm thinking about our internal representation for tensors and our API.