MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[Tests][fp16] test_bn_3d_spatial_test fails CI test frequently

Open junliume opened this issue 3 years ago • 20 comments

[Keywords]: test; vega20; batchnorm URGENCY :HIGH since it is already in the develop branch

[Env]: vega20 mainly Sometimes gfx908

[Description]:

This test fails frequently (and usually pass with rerun): http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/cpu_conv_acc_tpye_fix/1/pipeline http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_issue1564/1/pipeline/ http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/ocl_miopengemm/3/pipeline http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/api_nchwc/14/pipeline

[2022-05-28T05:05:32.463Z]   9/100 Test   #3: test_bn_3d_spatial_test ...............................***Failed  Error regular expression found in output. Regex=[FAILED] 51.42 sec

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] (n=1) is not supported for BN operation.

[2022-05-28T05:05:32.463Z] /var/jenkins/workspace/MLLibs_MIOpen_develop/build/bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 

[2022-05-28T05:05:32.463Z] FAILED: -nan

[2022-05-28T05:05:32.463Z] Iteration: 26

[2022-05-28T05:05:32.463Z] Backward Batch Spatial Normalization Use Saved Mean and Variance: 

[2022-05-28T05:05:32.463Z] X Input tensor: 128, 32, 6, 6, 6

[2022-05-28T05:05:32.463Z] Delta Y Input tensor: 128, 32, 6, 6, 6

[2022-05-28T05:05:32.463Z] Delta X output tensor output failed verification.

[2022-05-28T05:05:32.463Z] Max diff: inf

[2022-05-28T05:05:32.463Z] Mismatch at 0: -0.000190616 != inf

[2022-05-28T05:05:32.463Z] Non finite number found in gpu at 0: inf

[2022-05-28T05:05:32.463Z] /var/jenkins/workspace/MLLibs_MIOpen_develop/build/bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 

[2022-05-28T05:05:32.463Z] FAILED: inf

[2022-05-28T05:05:32.463Z] Iteration: 26

[2022-05-28T05:05:32.463Z] Backward Batch Spatial Normalization Use Saved Mean and Variance: 

[2022-05-28T05:05:32.463Z] X Input tensor: 128, 32, 6, 6, 6

[2022-05-28T05:05:32.463Z] Delta Y Input tensor: 128, 32, 6, 6, 6

[2022-05-28T05:05:32.463Z] Delta shift output tensor failed verification.

[2022-05-28T05:05:32.463Z] Max diff: 1.31269e+23

[2022-05-28T05:05:32.463Z] Mismatch at 0: -4.14615 != 3.61019e+18

Currently hard to reproduce since it is never a problem on my local Navi21 node.

junliume avatar Feb 11 '22 20:02 junliume

@muralinr and @DrizztDoUrden could you please take a look with me too?

junliume avatar Feb 11 '22 20:02 junliume

It looks like this issue is specific to Vega20. I tried it on Navi21 and did not see this issue on the latest develop.

muralinr avatar Feb 11 '22 23:02 muralinr

Right, I tried on another gfx900 and cannot reproduce this issue either.

junliume avatar Feb 12 '22 03:02 junliume

It seems that it only fails for gfx906

junliume avatar Feb 12 '22 06:02 junliume

Tried on a gfx906 and still cannot reproduce this issue.

junliume avatar Feb 14 '22 20:02 junliume

@DrizztDoUrden @shurale-nkn could you reproduce this issue?

junliume avatar May 29 '22 07:05 junliume

@carlushuang @shaojiewang do you have vega to test if the issue is reproducible?

junliume avatar May 30 '22 07:05 junliume

The problem is not reproducible on a gfx900 (with ROCm 5.0 base and ROCm 5.2 docker) Lower the urgency level. However, it is still a "high" issue since it impacts MIOpen CI stability

junliume avatar May 31 '22 07:05 junliume

Now this issue is happening on gfx908 again: http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/issue_1576_bwdfp16gpuref/5/pipeline

@JehandadKhan could we assign one host/API engineer on this issue?

junliume avatar Jun 06 '22 07:06 junliume

After some discussion: @muralinr could you try running this test multiple times on a MI100 development node, and see if we can reproduce it? I would suggest some static code checks and see if there is anything suspicious.

junliume avatar Jun 06 '22 17:06 junliume

@muralinr

Sure Jun.

muralinr avatar Jun 06 '22 17:06 muralinr

--half --input 128, 32, 6, 6, 6 I am unable to reproduce this issue on my MI100 machine. I will continue to analyze this issue to see if anything suspicious.

root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6 root@miopen908-1:/testx/MIOpen/build# ./bin/test_bn_3d_spatial_test --half --input 128, 32, 6, 6, 6

muralinr avatar Jun 06 '22 18:06 muralinr

Typical culprits are either (1) the accumulation for mean and variance reduction is being done in half which will intermittently result in an inf and nan, or (2) epsilon is set too low and is resulting in numerical instability.

daniellowell avatar Jun 13 '22 14:06 daniellowell

@junliume Excuse me, but I think it's too soon to close this. We have W/A, which allows to reduce urgency. But we do not know the root reason of the issue yet, and we do not have a fix.

atamazov avatar Jun 21 '22 23:06 atamazov

@atamazov yes we should not close this (automatically closed with the merged PR for WA).

Actually I think the urgency of this one should be higher since now we are observing MI100 having similar issues. @zjing14 is taking a look at it.

Our plan is to replace this OCL kernel with CK BN kernels.

junliume avatar Jun 22 '22 18:06 junliume

Similar failure found in PR1611 https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1611

Snippet and full log as below http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/sl%2Fgoogle_test/26/pipeline/613/

[2022-08-02T09:30:01.914Z] 8/103 Test #3: test_bn_3d_spatial_test ...............................***Failed Error regular expression found in output. Regex=[FAILED] 44.68 sec

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

[2022-08-02T09:30:01.914Z] (n=1) is not supported for BN operation.

xinlipn avatar Aug 02 '22 15:08 xinlipn

Thanks, @xinlipn It has been planned for a while that BN will be eventually replaced by CK, thus we are treating existing issues at relatively lower priority especially since this one is not always reproducible.

junliume avatar Aug 02 '22 17:08 junliume

@junliume WORKAROUND_ISSUE_1424 is stil in our code, so I recommend reopening it with https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_low

atamazov avatar Nov 28 '22 22:11 atamazov

@junliume WORKAROUND_ISSUE_1424 is stil in our code, so I recommend reopening it with urgency_low

Ping. test_bn_3d_spatial_test is still disabled. BN implementation is still OCL-based.

atamazov avatar Dec 29 '22 13:12 atamazov

@junliume Is this ticket still relevant? Thanks!

ppanchad-amd avatar Apr 16 '24 15:04 ppanchad-amd