about fp16
When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py).

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.
Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.
1. batchnorm.py:
if self._parallel_id == 0:
mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))
2.comm.py:
results = self._master_callback(intermediates)
The error is 'An error occured.'
My try-catch like this:
except IOError: print('An error occured trying to read the file.')
except ValueError: print('Non-numeric data found in the file.')
except ImportError: print "NO module found"
except EOFError: print('Why did you do an EOF on me?')
except KeyboardInterrupt: print('You cancelled the operation.')
except: print('An error occured.')
Can you give detailed information about the "error"?
For example, you may directly wrap the whole function body of forward() with a try-catch statement:
try:
# original codes
except:
import traceback
traceback.print_exc()
The detailed information
Traceback (most recent call last): File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size)) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master results = self.master_callback(intermediates) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master mean, inv_std = self.compute_mean_std(sum, ssum, sum_size) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std mean = sum / size RuntimeError: value cannot be converted to type at::Half without overflow: 528392
Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?
I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.