mxnet
mxnet copied to clipboard
[v1.x][CI] test_laop failed on CI
Description
On v1.x CI, test_operator_gpu.test_laop failed the numerical assertion by a large margin (~60% mismatch).
[2022-05-25T18:08:54.036Z] ======================================================================
[2022-05-25T18:08:54.036Z] FAIL: test_operator_gpu.test_laop
[2022-05-25T18:08:54.036Z] ----------------------------------------------------------------------
[2022-05-25T18:08:54.036Z] Traceback (most recent call last):
[2022-05-25T18:08:54.036Z] File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
[2022-05-25T18:08:54.037Z] self.test(*self.arg)
[2022-05-25T18:08:54.037Z] File "/usr/local/lib/python3.7/dist-packages/nose/util.py", line 620, in newfunc
[2022-05-25T18:08:54.037Z] return func(*arg, **kw)
[2022-05-25T18:08:54.037Z] File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 218, in test_new
[2022-05-25T18:08:54.037Z] orig_test(*args, **kwargs)
[2022-05-25T18:08:54.037Z] File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6593, in test_laop
[2022-05-25T18:08:54.037Z] check_fw_grad(test_potri, [data_in], [res_potri])
[2022-05-25T18:08:54.037Z] File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6558, in check_fw_grad
[2022-05-25T18:08:54.037Z] atol=atol_bw, dtype=dtype)
[2022-05-25T18:08:54.037Z] File "/work/mxnet/python/mxnet/test_utils.py", line 1238, in check_numeric_gradient
[2022-05-25T18:08:54.037Z] ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
[2022-05-25T18:08:54.037Z] File "/work/mxnet/python/mxnet/test_utils.py", line 749, in assert_almost_equal
[2022-05-25T18:08:54.037Z] raise AssertionError(msg)
[2022-05-25T18:08:54.037Z] AssertionError:
[2022-05-25T18:08:54.037Z] Items are not equal:
[2022-05-25T18:08:54.037Z] Error 62764.305585 exceeds tolerance rtol=1.000000e-05, atol=1.000000e-05 (mismatch at least 68.750000%).
[2022-05-25T18:08:54.037Z] Location of maximum error: (1, 2, 0, 0), NUMERICAL_data1=-0.00285029, BACKWARD_data1=-1.69324987
[2022-05-25T18:08:54.037Z] ACTUAL: array([[[[-0.00179968]],
[2022-05-25T18:08:54.037Z]
[2022-05-25T18:08:54.037Z] [[-0.00871502]],...
[2022-05-25T18:08:54.037Z] DESIRED: array([[[[-0.63388058]],
[2022-05-25T18:08:54.037Z]
[2022-05-25T18:08:54.037Z] [[-1.51251872]],...
[2022-05-25T18:08:54.037Z] -------------------- >> begin captured stdout << ---------------------
[2022-05-25T18:08:54.037Z]
[2022-05-25T18:08:54.037Z] *** Maximum errors for vector of size 16: rtol=1e-05, atol=1e-05
[2022-05-25T18:08:54.037Z]
[2022-05-25T18:08:54.037Z] 1: Error 62764.305585 Location of error: (1, 2, 0, 0), NUMERICAL_data1=-0.00285029, BACKWARD_data1=-1.69324987
[2022-05-25T18:08:54.037Z] 2: Error 61551.914711 Location of error: (2, 3, 0, 0), NUMERICAL_data1=-0.00620331, BACKWARD_data1=-1.61704400
[2022-05-25T18:08:54.037Z] 3: Error 60891.178802 Location of error: (2, 0, 0, 0), NUMERICAL_data1=-0.02516404, BACKWARD_data1=-1.62131153
[2022-05-25T18:08:54.037Z] 4: Error 59852.437601 Location of error: (0, 1, 0, 0), NUMERICAL_data1=-0.00871502, BACKWARD_data1=-1.51251872
[2022-05-25T18:08:54.037Z] 5: Error 59245.766913 Location of error: (0, 3, 0, 0), NUMERICAL_data1=-0.29727407, BACKWARD_data1=-2.18316398
[2022-05-25T18:08:54.037Z] 6: Error 57879.102497 Location of error: (0, 2, 0, 0), NUMERICAL_data1=-0.00213730, BACKWARD_data1=-1.37919265
[2022-05-25T18:08:54.037Z] 7: Error 57522.249004 Location of error: (3, 1, 0, 0), NUMERICAL_data1=-0.00394569, BACKWARD_data1=-1.36346243
[2022-05-25T18:08:54.037Z] 8: Error 54722.891910 Location of error: (3, 0, 0, 0), NUMERICAL_data1=-0.00189854, BACKWARD_data1=-1.21281479
[2022-05-25T18:08:54.037Z] 9: Error 53389.739167 Location of error: (1, 1, 0, 0), NUMERICAL_data1=-0.11656958, BACKWARD_data1=-1.39554458
[2022-05-25T18:08:54.037Z] 10: Error 51433.366492 Location of error: (1, 0, 0, 0), NUMERICAL_data1=-0.00173236, BACKWARD_data1=-1.06259378
[2022-05-25T18:08:54.037Z]
[2022-05-25T18:08:54.037Z] --------------------- >> end captured stdout << ----------------------
[2022-05-25T18:08:54.037Z] -------------------- >> begin captured logging << --------------------
[2022-05-25T18:08:54.037Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=563683812 to reproduce.
[2022-05-25T18:08:54.037Z] --------------------- >> end captured logging << ---------------------
Occurrences
Re-ran the teat multiple times (> 10 times), only a few passed: https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/activity?branch=PR-21039