apex icon indicating copy to clipboard operation
apex copied to clipboard

test_loss_scale_decrease fails with some random seeds

Open hartb opened this issue 6 years ago • 2 comments

Test test_loss_scale_decrease in run_amp/test_checkpointing.py fails consistently with certain random seeds:

$ diff -u test_checkpointing.py.orig test_checkpointing.py
--- test_checkpointing.py.orig  2020-01-29 23:28:10.266063356 +0000
+++ test_checkpointing.py       2020-01-29 23:28:33.493162829 +0000
@@ -162,6 +162,7 @@
                             continue
 
     def test_loss_scale_decrease(self):
+        torch.manual_seed(2)
         num_losses = 3
         nb_decrease_loss_scales = [0, 1, 2]
         for opt_level in self.test_opt_levels:

$ python -m pytest -v test_checkpointing.py::TestCheckpointing::test_loss_scale_decrease
...
>               self.assertEqual(update_ls, init_ls / 2**factor)
E               AssertionError: 32768.0 != 16384.0

test_checkpointing.py:213: AssertionError

The failure always seems to occur when opt_level = O1, and always with the same values failing the assertion:

update_ls = 32768.0
init_ls   = 65536.0
factor    = 2

The failure is consistent for me (same failing seeds, opt_level, and values) across:

  • x86_64 and ppc64le
  • CUDA 10.1 and 10.2
  • PyTorch 1.2.0 and 1.3.1

hartb avatar Jan 29 '20 23:01 hartb

I also find the error, could you tell me how you solve it.

HangJie720 avatar Dec 31 '21 04:12 HangJie720

I'm afraid we just noted the failure and moved on, hoping the NVIDIA team would be able to recreate and resolve.

hartb avatar Jan 04 '22 14:01 hartb