Synchronized-BatchNorm-PyTorch After sync batch norm is applied, more gpu memory is consumed

First of all, thank you for the implementation. It's very helpful.

I have one question. After sync batch norm is applied, it consumes more GPU memory than normal batch norm. Is it right?

Sep 28 '18 07:09 shachoi

Hi @shachoi Thank you for your interested in!

I haven't test the memory usage carefully, but a quick answer is yes, mainly for the master GPU card, because it need to collect the statistics from other cards.

But I don't think there will be some big difference. Could you please share with us how precisely is the extra memory usage (in percent, for example)?

Sep 28 '18 14:09 vacancy

Hi @vacancy, Thanks a lot for your reply :)

I have tested sync batch norm on deeplab-resnet based segmentation task. When I applied sync batch norm, it consumes about 30-40% more GPU memory. Detailed memory consumption info. is as follows.

sync batch norm : GPU1 - 8769 / GPU2 - 7125 / GPU3 - 7125
pytorch typical batch norm : GPU1 - 6687 / GPU2 - 5039 / GPU3 - 5039

Sep 28 '18 17:09 shachoi

Hi,

I currently have little idea about the exact cause of the memory consumption. I will probably revisit this issue next week.

Just for your reference, here is another project using this SyncBN: https://github.com/CSAILVision/semantic-segmentation-pytorch

@Tete-Xiao, do you have any comment on this?

Sep 28 '18 19:09 vacancy

@vacancy I did notice that the segmentation framework consumes more GPU memory than the normal one.

Sep 29 '18 19:09 Tete-Xiao

@shachoi Thank you for posting this issue! I think the memory consumption issue is confirmed. I will get back to this next week.

Sep 30 '18 00:09 vacancy

Hi @vacancy 。 Thanks for your great work!, And do you have any solution to the memory consumption issue now?

Nov 18 '18 01:11 Hellomodo

@Tete-Xiao If you have spare time recently, can you help me with this issue?

@Hellomodo Here is my quick reply. There are two major reasons.

We use the NCCL backend provided by PyTorch to sync the feature statistics across GPUs. This requires a certain amount of extra memory. Although it shouldn't be this much in theory, in practice, PyTorch/NCCL might allocate more memory than required, depending on the implementation.
We implemented bachnorm using primitive PyTorch apis, this requires extra memories to store intermediate variables. One way to reduce such cost is by optimizing the codes in https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py.

Nov 18 '18 03:11 vacancy

I have faced the same issue. Any progress so far?

Nov 25 '18 01:11 yelantf

I have faced the same issue，too. Before using convet_model to replace typical batch norm with SynchronizedBatchNorm2d: GPU_1---7520, GPU_2---6756. After that, GPU_1---9760,GPU_2---8796. can you help me? @vacancy

Mar 16 '21 13:03 CarpeDiemly