Synchronized-BatchNorm-PyTorch icon indicating copy to clipboard operation
Synchronized-BatchNorm-PyTorch copied to clipboard

about fp16

Open 666zz666 opened this issue 6 years ago • 5 comments

When I use fp16 (16-bit float) and multi-gpu training,the code will wait in SyncBN(comm.py). tim 20190220212250

666zz666 avatar Feb 20 '19 13:02 666zz666

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.

vacancy avatar Feb 20 '19 13:02 vacancy

Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.

1. batchnorm.py:

    if self._parallel_id == 0:
        mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))

2.comm.py:

results = self._master_callback(intermediates)

The error is 'An error occured.'

My try-catch like this:

except IOError: print('An error occured trying to read the file.')

except ValueError: print('Non-numeric data found in the file.')

except ImportError: print "NO module found"

except EOFError: print('Why did you do an EOF on me?')

except KeyboardInterrupt: print('You cancelled the operation.')

except: print('An error occured.')

666zz666 avatar Feb 20 '19 14:02 666zz666

Can you give detailed information about the "error"?

For example, you may directly wrap the whole function body of forward() with a try-catch statement:

try:
    # original codes
except:
    import traceback
    traceback.print_exc()

vacancy avatar Feb 20 '19 15:02 vacancy

The detailed information

Traceback (most recent call last): File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size)) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master results = self.master_callback(intermediates) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master mean, inv_std = self.compute_mean_std(sum, ssum, sum_size) File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std mean = sum / size RuntimeError: value cannot be converted to type at::Half without overflow: 528392

666zz666 avatar Feb 21 '19 02:02 666zz666

Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?

I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.

vacancy avatar Feb 21 '19 14:02 vacancy