apex
apex copied to clipboard
test_loss_scale_decrease fails with some random seeds
Test test_loss_scale_decrease in run_amp/test_checkpointing.py fails consistently with certain random seeds:
$ diff -u test_checkpointing.py.orig test_checkpointing.py
--- test_checkpointing.py.orig 2020-01-29 23:28:10.266063356 +0000
+++ test_checkpointing.py 2020-01-29 23:28:33.493162829 +0000
@@ -162,6 +162,7 @@
continue
def test_loss_scale_decrease(self):
+ torch.manual_seed(2)
num_losses = 3
nb_decrease_loss_scales = [0, 1, 2]
for opt_level in self.test_opt_levels:
$ python -m pytest -v test_checkpointing.py::TestCheckpointing::test_loss_scale_decrease
...
> self.assertEqual(update_ls, init_ls / 2**factor)
E AssertionError: 32768.0 != 16384.0
test_checkpointing.py:213: AssertionError
The failure always seems to occur when opt_level = O1, and always with the same values failing the assertion:
update_ls = 32768.0
init_ls = 65536.0
factor = 2
The failure is consistent for me (same failing seeds, opt_level, and values) across:
- x86_64 and ppc64le
- CUDA 10.1 and 10.2
- PyTorch 1.2.0 and 1.3.1
I also find the error, could you tell me how you solve it.
I'm afraid we just noted the failure and moved on, hoping the NVIDIA team would be able to recreate and resolve.