mxnet icon indicating copy to clipboard operation
mxnet copied to clipboard

[v1.x][CI] test_laop failed on CI

Open waytrue17 opened this issue 3 years ago • 0 comments

Description

On v1.x CI, test_operator_gpu.test_laop failed the numerical assertion by a large margin (~60% mismatch).

[2022-05-25T18:08:54.036Z] ======================================================================

[2022-05-25T18:08:54.036Z] FAIL: test_operator_gpu.test_laop

[2022-05-25T18:08:54.036Z] ----------------------------------------------------------------------

[2022-05-25T18:08:54.036Z] Traceback (most recent call last):

[2022-05-25T18:08:54.036Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest

[2022-05-25T18:08:54.037Z]     self.test(*self.arg)

[2022-05-25T18:08:54.037Z]   File "/usr/local/lib/python3.7/dist-packages/nose/util.py", line 620, in newfunc

[2022-05-25T18:08:54.037Z]     return func(*arg, **kw)

[2022-05-25T18:08:54.037Z]   File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 218, in test_new

[2022-05-25T18:08:54.037Z]     orig_test(*args, **kwargs)

[2022-05-25T18:08:54.037Z]   File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6593, in test_laop

[2022-05-25T18:08:54.037Z]     check_fw_grad(test_potri, [data_in], [res_potri])

[2022-05-25T18:08:54.037Z]   File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6558, in check_fw_grad

[2022-05-25T18:08:54.037Z]     atol=atol_bw, dtype=dtype)

[2022-05-25T18:08:54.037Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 1238, in check_numeric_gradient

[2022-05-25T18:08:54.037Z]     ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

[2022-05-25T18:08:54.037Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 749, in assert_almost_equal

[2022-05-25T18:08:54.037Z]     raise AssertionError(msg)

[2022-05-25T18:08:54.037Z] AssertionError: 

[2022-05-25T18:08:54.037Z] Items are not equal:

[2022-05-25T18:08:54.037Z] Error 62764.305585 exceeds tolerance rtol=1.000000e-05, atol=1.000000e-05 (mismatch at least 68.750000%).

[2022-05-25T18:08:54.037Z] Location of maximum error: (1, 2, 0, 0), NUMERICAL_data1=-0.00285029, BACKWARD_data1=-1.69324987

[2022-05-25T18:08:54.037Z]  ACTUAL: array([[[[-0.00179968]],

[2022-05-25T18:08:54.037Z] 

[2022-05-25T18:08:54.037Z]         [[-0.00871502]],...

[2022-05-25T18:08:54.037Z]  DESIRED: array([[[[-0.63388058]],

[2022-05-25T18:08:54.037Z] 

[2022-05-25T18:08:54.037Z]         [[-1.51251872]],...

[2022-05-25T18:08:54.037Z] -------------------- >> begin captured stdout << ---------------------

[2022-05-25T18:08:54.037Z] 

[2022-05-25T18:08:54.037Z] *** Maximum errors for vector of size 16:  rtol=1e-05, atol=1e-05

[2022-05-25T18:08:54.037Z] 

[2022-05-25T18:08:54.037Z]   1: Error 62764.305585  Location of error: (1, 2, 0, 0), NUMERICAL_data1=-0.00285029, BACKWARD_data1=-1.69324987

[2022-05-25T18:08:54.037Z]   2: Error 61551.914711  Location of error: (2, 3, 0, 0), NUMERICAL_data1=-0.00620331, BACKWARD_data1=-1.61704400

[2022-05-25T18:08:54.037Z]   3: Error 60891.178802  Location of error: (2, 0, 0, 0), NUMERICAL_data1=-0.02516404, BACKWARD_data1=-1.62131153

[2022-05-25T18:08:54.037Z]   4: Error 59852.437601  Location of error: (0, 1, 0, 0), NUMERICAL_data1=-0.00871502, BACKWARD_data1=-1.51251872

[2022-05-25T18:08:54.037Z]   5: Error 59245.766913  Location of error: (0, 3, 0, 0), NUMERICAL_data1=-0.29727407, BACKWARD_data1=-2.18316398

[2022-05-25T18:08:54.037Z]   6: Error 57879.102497  Location of error: (0, 2, 0, 0), NUMERICAL_data1=-0.00213730, BACKWARD_data1=-1.37919265

[2022-05-25T18:08:54.037Z]   7: Error 57522.249004  Location of error: (3, 1, 0, 0), NUMERICAL_data1=-0.00394569, BACKWARD_data1=-1.36346243

[2022-05-25T18:08:54.037Z]   8: Error 54722.891910  Location of error: (3, 0, 0, 0), NUMERICAL_data1=-0.00189854, BACKWARD_data1=-1.21281479

[2022-05-25T18:08:54.037Z]   9: Error 53389.739167  Location of error: (1, 1, 0, 0), NUMERICAL_data1=-0.11656958, BACKWARD_data1=-1.39554458

[2022-05-25T18:08:54.037Z]  10: Error 51433.366492  Location of error: (1, 0, 0, 0), NUMERICAL_data1=-0.00173236, BACKWARD_data1=-1.06259378

[2022-05-25T18:08:54.037Z] 

[2022-05-25T18:08:54.037Z] --------------------- >> end captured stdout << ----------------------

[2022-05-25T18:08:54.037Z] -------------------- >> begin captured logging << --------------------

[2022-05-25T18:08:54.037Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=563683812 to reproduce.

[2022-05-25T18:08:54.037Z] --------------------- >> end captured logging << ---------------------

Occurrences

Re-ran the teat multiple times (> 10 times), only a few passed: https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/activity?branch=PR-21039

waytrue17 avatar May 25 '22 20:05 waytrue17