MIOpen miopenBatchNormalizationForwardInference() failing

miopenBatchNormalizationForwardInference function is failing when the nullptr is passed for mean & variance.

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/batch_normalization_layer.cpp#L121

The above function works fine but

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/amd_openvx_extensions/amd_nn/src/scale_layer.cpp#L103

this function returns Inf values.

They both call the same function (miopenBatchNormalizationForwardInference) but the second one passes nullptr for mean & variance.

I'm still working on my end to see if the issue is on our side but wanted to check if this is a known issue to miopen. I was able to find some tickets opened regarding Inf & NaN values so this issue might be related to that?

Feb 23 '22 22:02 hansely

@hansely I do not think this is known issue.

Can you please provide us with reproduce instructions? This is absolutely necessary in order to start working on it.

Feb 23 '22 22:02 atamazov

Hi @atamazov . You'll have to build MIVisionX first. You can either build it on your system or use the docker.

After building MIVisionX, you can try running the unit tests for scale layer & batch norm layer.

Scale layer unit test : Link. BatchNorm layer unit test : Link.

Please let me know if you don't have an authorization to this repository. The output of the unit test includes Inf values for scale layer where as the output of batch norm layer matches the golden results.

Feb 23 '22 23:02 hansely

@hansely Could you please rule out issue from your side as mentioned by you because both functions call same kernel?

Feb 23 '22 23:02 muralinr

Both slice & batchnorm call the same function but with different parameters. Slice pass nullptr for mean & varaince and 0 for epsilon.

Feb 23 '22 23:02 hansely

@hansely Can you please provide us input arguments in both slice and batchnorm calls? You may dump them in both calls.

Feb 23 '22 23:02 muralinr

Batchnorm:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, data->bnMean, data->bnVariance, data->eps)

Slice:

miopenBatchNormalizationForwardInference(miopenHandle, miopenBNSpatial, &data->alpha, &data->beta, data->input_desc, data->input_mem, data->output_desc, data->output_mem, data->bnScaleBiasMeanVarDesc, data->bnScale, data->bnBias, nullptr, nullptr, 0));

FYI: Providing epsilon value as 0.00001 instead of 0 has removed the Inf & NaN values in the output but still did not match the expected output.

Feb 24 '22 00:02 hansely

@hansely I looked at your input arguments. I would suggest not to use batchnorm op for Slice. Batchnorm has division by squareroot of variance. If variance is zero, it would result in Inf & NAN values which is expected in this case. Even if we add epsilon value as 0.00001, input value is changed.

Feb 24 '22 00:02 muralinr

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

Feb 24 '22 00:02 hansely

@muralinr I understand. However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

@hansely We will change documentation. Batchnorm code uses mean and variance.

@atamazov @junliume How do I change documentation here? I need to update the line saying that nullptr is not allowed for mean and variance.

https://rocmsoftwareplatform.github.io/MIOpen/doc/html/batchnorm.html?highlight=miopenbatchnorm#miopenbatchnormalizationforwardinference

Feb 24 '22 00:02 muralinr

@muralinr @atamazov You can change it here: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/include/miopen/miopen.h#L2410

Feb 24 '22 15:02 junliume

Thanks. Closing the issue.

Feb 24 '22 19:02 hansely

@muralinr

However, on miopen documentation it says passing nullptr for mean & variance is allowed. They will just not be used.

~If this is not allowed, then MIOpen should MIOPEN_THROW with error message.~

We will change documentation. Batchnorm code uses mean and variance.

Let's close the issue when:

~Error handling is implemented~
Documentation updated.

Feb 24 '22 21:02 atamazov

@muralinr Are you saying that the else branch is not needed here:

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L175 ... https://github.com/ROCmSoftwarePlatform/MIOpen/blob/df658f382c5af02133836a8fa5700a23b58c2ca0/src/ocl/batchnormocl.cpp#L224-L227 ...

Feb 24 '22 21:02 atamazov

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Feb 24 '22 23:02 muralinr

@atamazov Do you want me to raise a PR with comment changed here on develop? MIOpen/include/miopen/miopen.h

Feb 25 '22 01:02 muralinr

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

Mar 01 '22 04:03 hansely

@muralinr

@atamazov Do you want me to raise a PR with comment changed here on develop?

I think yes, because you said that:

We will change documentation...

Mar 01 '22 18:03 atamazov

@muralinr

No code change is needed. We just need to update the current documentation saying that "If either estimatedMean, or estimatedVariance are null pointers then the values for the mean and variance will be calculated from input data and input values will be updated with this calculated mean and variance." This will avoid confusion.

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior? If not, why is there a else part in the code? Would you be able to create sample test cases to test the else part?

@hansely We will phase out current batchnorm code soon and are in the process of implementing new batchnorm code. We are currently supporting blocking issues. I will update documentation with correct API usage.

Mar 01 '22 18:03 muralinr

@muralinr

Does this mean that having Inf & NaN as output values is a normal behavior? Is passing nullptr to mean & variance an allowed behavior?

These questions are good ones. AFAIU these have nothing to do with BN implementation, but closely realted to the API of BN. Can you provide answers please? Maybe the best solution is inclusion of answers into documentation in the upcoming PR ;)

Mar 01 '22 19:03 atamazov

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

Mar 02 '22 02:03 junliume

@muralinr I think we should raise a PR to update the documentation first. The mentioned questions are good ones too, when we are implementing the new BNs. Thanks!

Sure folks. I will raise a PR to update the documentation first.

Mar 02 '22 04:03 muralinr

This PR is raised to update API documentation. https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1447

Mar 04 '22 00:03 muralinr

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

Mar 04 '22 18:03 atamazov

@muralinr

The mentioned questions are good ones too, when we are implementing the new BNs.

Excuse me, Murali, but I still have one Q: Is current implementation correct?

If it is correct, then the new BN shall just replicate existing behavior, and the questions have nothing to do with it.

Am I missing something?

We will add more error checks in new BN implementation to avoid NANs for cases when espilon is ZERO and calculated variance is 0. I will try to update current BN code itself to report espilon issues and avoid NANs.

Mar 04 '22 20:03 muralinr

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

Mar 04 '22 21:03 hansely

@muralinr When do you expect the new implementation to be released? Do you have at least the documentation that explains the new BN implementation? Our team(MIVIsionX) might also need to update BN & scale layer code accordingly based on the new implementation.

@junliume Can you please comment on new BN implementation readiness date?

Mar 04 '22 23:03 muralinr

Epsilon is specifically designed to prevent numerical instability. It isn't recommended to have it be zero. In addition, recalculating mean and variance during inference is extremely inefficient.

Jun 10 '22 19:06 daniellowell

@hansely @junliume Internal ticket has been created to fix documentation. Thanks!

Apr 16 '24 15:04 ppanchad-amd