MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

MIOpenDriver pooling `-F 2` fails

Open carlushuang opened this issue 4 years ago • 16 comments

./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 192 -H 27 -W 27 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 1 -t 1
./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 192 -H 27 -W 27 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 2 -t 1   # <= computation error
./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 256 -H 13 -W 13 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 1 -t 1
./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 256 -H 13 -W 13 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 2 -t 1   # <= computation error
./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 64 -H 55 -W 55 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 1 -t 1
./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 64 -H 55 -W 55 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 2 -t 1   # <= computation error

above 6 configs are logged from alexnet. the backward pooling have computation error, with latest MIOpen code. Both gfx908/gfx90a will have this failure. This is found while debugging SWDEV-313696

./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 256 -H 13 -W 13 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 2 -t 1
GPU Kernel Time Backward Pooling Elapsed: 0.029442 ms
Sqr error : 0.0000997628062 Max err: 0.1173834800720 at 0, 0, 0, 0 c_v = 0.117383480072 vs g_v = 0.000000000000
Difference : 4593122791816953856.000000000000 too large (eps=4.000000000000) at 0,0, 0,0 c_v = 0.117383480072 vs g_v = 0.000000000000

cc @junliume @atamazov @DrizztDoUrden

carlushuang avatar Dec 03 '21 10:12 carlushuang

It looks like these numbers are troublesome:

MIOpen(HIP): miopenStatus_t miopenGetTensorDescriptorSize(miopenTensorDescriptor_t, int *){ MIOpen(HIP): tensorDesc = 32, 192, 27, 27 MIOpen(HIP): size = 0 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenGet4dTensorDescriptorLengths(miopenTensorDescriptor_t, int *, int *, int *, int *){ MIOpen(HIP): tensorDesc = 32, 192, 27, 27 MIOpen(HIP): n = 32764 MIOpen(HIP): c = 319990760 MIOpen(HIP): h = 3 MIOpen(HIP): w = 0 MIOpen(HIP): }

junliume avatar Dec 03 '21 20:12 junliume

Assigned @ce1adon as an implementer of many pooling features, see https://github.com/ROCmSoftwarePlatform/MIOpen/commits/develop/src/ocl/pooling_ocl.cpp

atamazov avatar Dec 03 '21 20:12 atamazov

Same symptoms on gfx906, ROCm 4.0.

atamazov avatar Dec 03 '21 20:12 atamazov

Tried the following commits on gfx906, ROCm 4.0:

  • ca809fee7 -- just prior 6370fbedf Pooling forward solvers (except GetWorkspaceSize) (#1284)
    • NOV 17, 2021
  • e2389abf0 -- just prior 79ba89316 Dynamic pooling (#515)
    • JUN 16, 2020
  • 3e3143ca2 -- just prior 2dd98e759 Pooling: add max index mode & disable padding mode test (#1239)
    • JUN 12, 2020

The issue exists in all these commits. It is possible that this is not a real failure but a bug in the verification code of the driver (which means that this issue may be unrelated to SWDEV-313696)

Participation of pooling code experts is needed.

atamazov avatar Dec 03 '21 21:12 atamazov

@atamazov Tensordesc is correct while the NCHW numbers are interpreted incorrectly, (if this is the case) unlikely it won't be caught sooner I guess.

MIOpen(HIP): tensorDesc = 32, 192, 27, 27 MIOpen(HIP): n = 32764 MIOpen(HIP): c = 319990760 MIOpen(HIP): h = 3 MIOpen(HIP): w = 0

update: just saw your recent comments, ooops.

junliume avatar Dec 03 '21 21:12 junliume

Update to https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1325#issuecomment-985867423:

  • ~It may worth to check also d9079e71a (which immediately precedes 5b31a8c07 ProblemDescription refactor (#2399), JAN 15, 2020), but I run into cmake issues and gave up.~ -- NO NEED, 5b31a8c07 contains trivial pooling changes.

atamazov avatar Dec 03 '21 21:12 atamazov

You only see errors in "-F 2" is because by running backward only you are missing info calculated from forward. Try "-F 0" and you will get correct results. The frameworks problem may lead by failing to pass workspace to backward in pooling.

ce1adon avatar Dec 06 '21 23:12 ce1adon

It's a red herring.

ce1adon avatar Dec 06 '21 23:12 ce1adon

@ce1adon we will log MIOpenDriver cmd with MIOPEN_ENABLE_LOGGING_CMD=1 while running model from tf/pytorch (this pooling cmd is found while we log cmd from framework, not we play with MIOpenDriver by ourselves). If the pooling will fail only in -F 2 case, I highly recommend you fix this by for example give it a random input number. Otherwise people will get confused while debug framework level stuffs. Which usually will take a lot of time to run.

carlushuang avatar Dec 07 '21 06:12 carlushuang

@ce1adon we will log MIOpenDriver cmd with MIOPEN_ENABLE_LOGGING_CMD=1 while running model from tf/pytorch (this pooling cmd is found while we log cmd from framework, not we play with MIOpenDriver by ourselves). If the pooling will fail only in -F 2 case, I highly recommend you fix this by for example give it a random input number. Otherwise people will get confused while debug framework level stuffs. Which usually will take a lot of time to run.

The workspace value is calculated indices. They cannot be random.

ce1adon avatar Dec 07 '21 18:12 ce1adon

@carlushuang @atamazov and @ce1adon any recommendations for enhancement or to fix this verification issue?

junliume avatar Dec 08 '21 03:12 junliume

I still recommend to mimic(by some kind of randomness) the indices inside the workspace buffer, and let backward pooling run based on this artificial indices buffer. In this case backward pooling can be run in a unit-test manner, which is very useful for our framework level debug. Otherwise we have to tell our users whenever they run this backward pooling only and see a failure. Or maybe document this fail behavior somewhere

carlushuang avatar Dec 08 '21 03:12 carlushuang

MIOpenDriver pool* to issue a descriptive error message when -F 2 is incorrectly used (verification must be disabled).

atamazov avatar Dec 08 '21 17:12 atamazov

You only see errors in "-F 2" is because by running backward only you are missing info calculated from forward.

@DrizztDoUrden could we follow up a bit on the warning message? If -F 2 is used without forward results and data, then we should issue such warning from MIOpenDriver. CC: @atamazov

junliume avatar Jan 21 '22 00:01 junliume

I guess we could.

DrizztDoUrden avatar Jan 21 '22 11:01 DrizztDoUrden

@carlushuang @DrizztDoUrden Is this fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? If resolved, please close ticket. Thanks!

ppanchad-amd avatar Apr 15 '24 19:04 ppanchad-amd