lbann icon indicating copy to clipboard operation
lbann copied to clipboard

Notation for tensor and matrix dimensions are inconsistent

Open timmoon10 opened this issue 6 years ago • 2 comments

In NCHW tensor notation, the last dimension is the contiguous dimension. For column-major matrix notation, the first dimension is the contiguous dimension. We haven't needed to think that much about this since our data samples are usually 1D or 3D, but with transformers we need to do batched matrix multiplication. We should settle this question and commit to a consistent scheme to avoid confusion.

As much as it pains me coming from an applied math background, I think we should switch to C/row-major/NCHW notation. It matches PyTorch, TensorFlow, and NumPy and seems to be more natural for practitioners.

Whatever we decide, DiHydrogen should use the same scheme as LBANN. Pinging @benson31, @naoyam, @ndryden.

timmoon10 avatar Nov 08 '19 19:11 timmoon10

We should, ideally, be flexible with regard to data layouts.

For example, convolution in NHWC layout is preferred for using Tensor Cores. It may make sense to transpose layouts between different layers for optimal performance. And so on.

ndryden avatar Nov 08 '19 19:11 ndryden

NCHW vs. NHWC is somewhat orthogonal to my concern, since both are using C-style tensor notation. I'm thinking about our internal representation for tensors and our API.

timmoon10 avatar Nov 08 '19 20:11 timmoon10