OpenMP offload performance is worse than sequential CPU performance
Summary
Provide a short summary of the issue. Sections below provide guidance on what factors are considered important to reproduce an issue. The time taken to complete calculation of pi using sequential CPU execution is better than OpenMP CPU and OpenMP GPU
Version
Report oneAPI Toolkit version and oneAPI Sample version or hash. Samples commit id: 4bed52e76ceb17243a0bc4ce24e9aed52aaa6e49
Environment
Provide OS information and hardware information if applicable. Ubuntu 20.04 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz
Steps to reproduce
Please check that the issue is reproducible with the latest revision on master. Include all the steps to reproduce the issue. build and run the openmp reduction sample.
Observed behavior
Document behavior you observe. For performance defects, like performance regressions or a function being slow, provide a log if possible. observe the time taken by each and notice that openmp offload performs the worst.
Expected behavior
Document behavior you expect. OpenMP offload should perform better.
Verified Fix
Increase the number of steps by a factor of 100 and the time taken for each lines up according to expectations. OpenMP offload better than OpenMP CPU better than Seq CPU. We do not want developers to be presented with a poorly performing offload to begin with. The alternative is mention this in the README or perhaps do both. Increase num_steps and add some detail to the README. It will also be good to add the num_steps as a tunable parameter with a default value so developer can play with it.