Synchronized-BatchNorm-PyTorch Network performance is getting worse

Hi, Thank you for your work and sharing. I try to use convert_model function on my own code, for example:

    cudnn.benchmark = True
    net = Network()
    net.cuda()
    net = nn.DataParallel(net, device_ids=args.gpus)

    net = convert_model(net)

However, after training, I found that the result is far away from my expectations, even worse than using the nn.BatchNorm2d that comes with the PyTorch. Do I use convert_model function wrongly? Or are there some points to note? Thank you very much!

Apr 01 '19 13:04 charlesCXK

Hi,

Could you please add more detail on this? For example,

What do you mean by “much worse” than the original BatchNorm. Also, please also check the hyperparameters: are you using the same hyperparameters for all implementations? This includes the batch size, learning rate, and especially the parameters for the BN layers (momentum, eps, etc.) If you are using a larger batch size due to multi gpu, please scale up the lr.

Apr 02 '19 00:04 vacancy

Hi, I use the same hyperparameters for all implementations. In order to achieve this, I use the following command to replace BatchNorm function: BatchNorm2d = xxx, where xxx is the batchnorm function I use.

I test three BatchNorm functions on the NYU Depth V2 dataset, (semantic segmantation task), and here is their performance:

Inplace ABN: Mean IoU ≈ 0.49
nn.BatchNorm2d: Mean IoU < 0.46
Your convert_model function: Mean IoU < 0.46, and a little smaller than nn.BatchNorm2d

Apr 02 '19 03:04 charlesCXK