lbann icon indicating copy to clipboard operation
lbann copied to clipboard

Extending LBANN Distconv Interface

Open szaman19 opened this issue 3 years ago • 0 comments

The LBANN Distconv adapter for layers mandates that only the first input tensor to distconv-enabled layer can be a non-DiHydrogen tensor. We raise an error if a tensor requires a copy to a DiHydrogen tensor. The following checks are done:

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L329

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L646

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L787

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L812

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L836

https://github.com/LLNL/lbann/blob/3b0ea84e2e0b86d14f466d9abe7c60e8b026e84a/src/layers/data_type_distconv_adapter.cpp#L861

While these worked for the original DC layers (Convolution, MSE, ReLU), mewer DC layers such as Scatter, Gather, and MatMul generally have more than one input that may need to be copied to DiHydrogen tensors, so ideally we should support the case for multiple parent tensors requiring copy. Simply removing the checks resulted in failing CI tests.

Possible workaround with Identity layer as a copy layer also has issues: #2126

szaman19 avatar Aug 09 '22 21:08 szaman19