After sync batch norm is applied, more gpu memory is consumed
First of all, thank you for the implementation. It's very helpful.
I have one question. After sync batch norm is applied, it consumes more GPU memory than normal batch norm. Is it right?
Hi @shachoi Thank you for your interested in!
I haven't test the memory usage carefully, but a quick answer is yes, mainly for the master GPU card, because it need to collect the statistics from other cards.
But I don't think there will be some big difference. Could you please share with us how precisely is the extra memory usage (in percent, for example)?
Hi @vacancy, Thanks a lot for your reply :)
I have tested sync batch norm on deeplab-resnet based segmentation task. When I applied sync batch norm, it consumes about 30-40% more GPU memory. Detailed memory consumption info. is as follows.
- sync batch norm : GPU1 - 8769 / GPU2 - 7125 / GPU3 - 7125
- pytorch typical batch norm : GPU1 - 6687 / GPU2 - 5039 / GPU3 - 5039
Hi,
I currently have little idea about the exact cause of the memory consumption. I will probably revisit this issue next week.
Just for your reference, here is another project using this SyncBN: https://github.com/CSAILVision/semantic-segmentation-pytorch
@Tete-Xiao, do you have any comment on this?
@vacancy I did notice that the segmentation framework consumes more GPU memory than the normal one.
@shachoi Thank you for posting this issue! I think the memory consumption issue is confirmed. I will get back to this next week.
Hi @vacancy 。 Thanks for your great work!, And do you have any solution to the memory consumption issue now?
@Tete-Xiao If you have spare time recently, can you help me with this issue?
@Hellomodo Here is my quick reply. There are two major reasons.
- We use the NCCL backend provided by PyTorch to sync the feature statistics across GPUs. This requires a certain amount of extra memory. Although it shouldn't be this much in theory, in practice, PyTorch/NCCL might allocate more memory than required, depending on the implementation.
- We implemented bachnorm using primitive PyTorch apis, this requires extra memories to store intermediate variables. One way to reduce such cost is by optimizing the codes in
https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py.
I have faced the same issue. Any progress so far?
I have faced the same issue,too. Before using convet_model to replace typical batch norm with SynchronizedBatchNorm2d: GPU_1---7520, GPU_2---6756. After that, GPU_1---9760,GPU_2---8796. can you help me? @vacancy