Padding as fusion
Many CNNs layers use padding (of 1) on their inputs, usually by adding 0s to the halo around the image size. Another way to pad (better in certain situations, but not necessarily always) is by repetition of the nearest data on the boundary.
A good way to implement any pad style is by explicitly padding the data on the output of the layer that produces them (typically a previous convolution). Explicitly here means to have memory that corresponds to (H+2)x(W+2) image, as opposed to HxW image.
A no brainer situation that this scheme will be better than the explicit calculation, is when H,W are not already magic numbers, i.e. when incrementing them will not increase the number of workgroups, or reduce the cache line utilization. For example, the image size of 224 for imagenet would not suffer by increasing it to 226!