MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

miopenBatchNormalizationForwardInference() failing

Open hansely opened this issue 3 years ago • 28 comments

miopenBatchNormalizationForwardInference function is failing when the nullptr is passed for mean & variance.

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/batch_normalization_layer.cpp#L121

The above function works fine but

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/scale_layer.cpp#L103

this function returns Inf values.

They both call the same function (miopenBatchNormalizationForwardInference) but the second one passes nullptr for mean & variance.

I'm still working on my end to see if the issue is on our side but wanted to check if this is a known issue to miopen. I was able to find some tickets opened regarding Inf & NaN values so this issue might be related to that?

hansely avatar Feb 23 '22 22:02 hansely

@hansely I do not think this is known issue.

Can you please provide us with reproduce instructions? This is absolutely necessary in order to start working on it.

atamazov avatar Feb 23 '22 22:02 atamazov

Hi @atamazov . You'll have to build MIVisionX first. You can either build it on your system or use the docker.

After building MIVisionX, you can try running the unit tests for scale layer & batch norm layer.

Scale layer unit test : Link. BatchNorm layer unit test : Link.

Please let me know if you don't have an authorization to this repository. The output of the unit test includes Inf values for scale layer where as the output of batch norm layer matches the golden results.

hansely avatar Feb 23 '22 23:02 hansely

@hansely Could you please rule out issue from your side as mentioned by you because both functions call same kernel?

muralinr avatar Feb 23 '22 23:02 muralinr

Both slice & batchnorm call the same function but with different parameters. Slice pass nullptr for mean & varaince and 0 for epsilon.

hansely avatar Feb 23 '22 23:02 hansely

@hansely Can you please provide us input arguments in both slice and batchnorm calls? You may dump them in both calls.

muralinr avatar Feb 23 '22 23:02 muralinr

Batchnorm:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, data->bnMean, data->bnVariance, data->eps)

Slice:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, nullptr, nullptr, 0));

FYI: Providing epsilon value as 0.00001 instead of 0 has removed the Inf & NaN values in the output but still did not match the expected output.

hansely avatar Feb 24 '22 00:02 hansely

@hansely I looked at your input arguments. I would suggest not to use batchnorm op for Slice. Batchnorm has division by squareroot of variance. If variance is zero, it would result in Inf & NAN values which is expected in this case. Even if we add epsilon value as 0.00001, input value is changed. Batchnorm

muralinr avatar Feb 24 '22 00:02 muralinr

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

hansely avatar Feb 24 '22 00:02 hansely

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

@hansely We will change documentation. Batchnorm code uses mean and variance.

@atamazov @junliume How do I change documentation here? I need to update the line saying that nullptr is not allowed for mean and variance.

https://rocmsoftwareplatform.github.io/MIOpen/doc/html/batchnorm.html?highlight=miopenbatchnorm#miopenbatchnormalizationforwardinference

muralinr avatar Feb 24 '22 00:02 muralinr

@muralinr @atamazov You can change it here: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/include/miopen/miopen.h#L2410

junliume avatar Feb 24 '22 15:02 junliume

Thanks. Closing the issue.

hansely avatar Feb 24 '22 19:02 hansely

@muralinr

However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

~If this is not allowed, then MIOpen should MIOPEN_THROW with error message.~

We will change documentation. Batchnorm code uses mean and variance.

Let's close the issue when:

  • ~Error handling is implemented~
  • Documentation updated.

atamazov avatar Feb 24 '22 21:02 atamazov

@muralinr Are you saying that the else branch is not needed here:

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L175 ... https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L224-L227 ...

atamazov avatar Feb 24 '22 21:02 atamazov

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

muralinr avatar Feb 24 '22 23:02 muralinr

@atamazov Do you want me to raise a PR with comment changed here on develop? MIOpen/include/miopen/miopen.h

muralinr avatar Feb 25 '22 01:02 muralinr

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

hansely avatar Mar 01 '22 04:03 hansely

@muralinr

@atamazov Do you want me to raise a PR with comment changed here on develop?

I think yes, because you said that:

We will change documentation...

atamazov avatar Mar 01 '22 18:03 atamazov

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

@hansely We will phase out current batchnorm code soon and are in the process of implementing new batchnorm code. We are currently supporting blocking issues. I will update documentation with correct API usage.

muralinr avatar Mar 01 '22 18:03 muralinr

@muralinr

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior?

These questions are good ones. AFAIU these have nothing to do with BN implementation, but closely realted to the API of BN. Can you provide answers please? Maybe the best solution is inclusion of answers into documentation in the upcoming PR ;)

atamazov avatar Mar 01 '22 19:03 atamazov

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

junliume avatar Mar 02 '22 02:03 junliume

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

Sure folks. I will raise a PR to update the documentation first.

muralinr avatar Mar 02 '22 04:03 muralinr

This PR is raised to update API documentation. https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1447

muralinr avatar Mar 04 '22 00:03 muralinr

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

atamazov avatar Mar 04 '22 18:03 atamazov

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

We will add more error checks in new BN implementation to avoid NANs for cases when espilon is ZERO and calculated variance is 0. I will try to update current BN code itself to report espilon issues and avoid NANs.

muralinr avatar Mar 04 '22 20:03 muralinr

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

hansely avatar Mar 04 '22 21:03 hansely

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

@junliume Can you please comment on new BN implementation readiness date?

muralinr avatar Mar 04 '22 23:03 muralinr

Epsilon is specifically designed to prevent numerical instability. It isn't recommended to have it be zero. In addition, recalculating mean and variance during inference is extremely inefficient.

daniellowell avatar Jun 10 '22 19:06 daniellowell

@hansely @junliume Internal ticket has been created to fix documentation. Thanks!

ppanchad-amd avatar Apr 16 '24 15:04 ppanchad-amd