computecpp-sdk icon indicating copy to clipboard operation
computecpp-sdk copied to clipboard

ptxas fatal : Unresolved extern function '_Z5rsqrtf'

Open j-stephan opened this issue 7 years ago • 14 comments

I am trying to get cl::sycl::rsqrt() to run on a GTX 1080 Ti. Running the example code pasted below I get the following error at runtime:

Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x24e9950 device 0 (size 56):
ptxas fatal   : Unresolved extern function '_Z5rsqrtf'

I know the PTX backend is currently experimental - is this one of the functions that is known to not work?

Example code:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>

#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wunused-parameter"
#include <CL/sycl.hpp>
#pragma clang diagnostic pop

auto main() -> int
{
    using namespace cl::sycl;

    try
    {
        auto gen = std::mt19937{std::random_device{}()};
        auto dis = std::uniform_real_distribution<float>{-42.f, 42.f};

        auto platforms = platform::get_platforms();
        std::cout << "Available platforms: " << std::endl;
        for(auto i = 0u; i < platforms.size(); ++i)
        {
            auto&& p = platforms[i];
            auto vendor = p.get_info<info::platform::vendor>();
            auto name = p.get_info<info::platform::name>();
            std::cout << "\t[" << i << "] Vendor: " << vendor << ", "
                      << "name: " << name << std::endl;
        }

        std::cout << std::endl;
        std::cout << "Select platform: ";
        auto index = 0u;
        std::cin >> index;

        if(index >= platforms.size())
        {
            std::cout << "I'm sorry, Dave. I'm afraid I can't do that."
                      << std::endl;
            return EXIT_FAILURE;
        }

        auto&& platform = platforms[index];

        // set up default context
        auto ctx = context{platform};

        // set up default accelerator
        auto accelerators = ctx.get_devices();
        std::cout << "Available accelerators: " << std::endl;
        for(auto i = 0u; i < accelerators.size(); ++i)
        {
            auto&& acc = accelerators[i];
            auto vendor = acc.get_info<info::device::vendor>();
            auto name = acc.get_info<info::device::name>();
            std::cout << "\t[" << i << "] Vendor: " << vendor << ", "
                      << "name: " << name << std::endl;
        }

        std::cout << std::endl;
        std::cout << "Select accelerator: ";
        std::cin >> index;

        if(index >= accelerators.size())
        {
          std::cout << "I'm sorry, Dave. I'm afraid I can't do that."
                    << std::endl;
          return EXIT_FAILURE;
        }

        auto acc = accelerators[index];

        // create queue on device
        auto exception_handler = [] (exception_list exceptions)
        {
          for(std::exception_ptr e : exceptions)
          {
              try
              {
                  std::rethrow_exception(e);
              }
              catch(const cl::sycl::exception& err)
              {
                  std::cerr << "Caught asynchronous SYCL exception: "
                            << err.what() << std::endl;
              }
          }
        };

        auto queue = cl::sycl::queue{acc, exception_handler,
                                   property::queue::enable_profiling{}};

        auto data = std::vector<float>{};
        auto result = std::vector<float>{};

        data.resize(2048);
        result.resize(2048);

        std::generate(begin(data), end(data), [&]() { return dis(gen); });
        std::fill(begin(result), end(result), 0.f);

        auto d_data = buffer<float, 1>{begin(data), end(data)};
        auto d_result = buffer<float, 1>{begin(result), end(result)};

        queue.submit([&] (handler& cgh)
        {
            auto data_acc = d_data.get_access<access::mode::read,
                                              access::target::global_buffer>(cgh);
            auto result_acc = d_result.get_access<access::mode::discard_write,
                                                  access::target::global_buffer>(cgh);

            cgh.parallel_for<class dummy>(nd_range<1>{range<1>{2048}, range<1>{1024}},
            [=](nd_item<1> my_item)
            {
                auto id = my_item.get_global_id();
                result_acc[id] = rsqrt(data_acc[id]);
            });
        });
    }
    catch(const cl::sycl::exception& e)
    {
        std::cerr << e.what() << std::endl;
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

Compiler command line:

$ compute++ rsqrt.cpp -I/opt/sycl/computecpp/include -L/opt/sycl/computecpp/lib -lComputeCpp -sycl-driver -sycl-target ptx64

Compiler version:

$ compute++ --version
Codeplay ComputeCpp - CE 1.0.3 Device Compiler - clang version 6.0.0 ([email protected]:sycl/clang.git bdb9ca7694da1840ced2db3aaeff545d89069cf7) ([email protected]:sycl/llvm.git d5ee99ce31ef67e801f1a7176b1bd2683f9a09ed) (based on LLVM 6.0.0svn)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/sycl/computecpp/bin

computecpp_info:

********************************************************************************

ComputeCpp Info (CE 1.0.3)

SYCL 1.2.1 revision 3

********************************************************************************

Toolchain information:

GLIBC version: 2.27
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 3 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : NO - Device does not support SPIR
  CL_DEVICE_NAME                          : GeForce GTX 1080 Ti
  CL_DEVICE_VENDOR                        : NVIDIA Corporation
  CL_DRIVER_VERSION                       : 410.79
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v1.0.3/platform-support-notes

********************************************************************************

j-stephan avatar Jan 09 '19 11:01 j-stephan

Hi @j-stephan, thanks for your very comprehensive bug report! Sadly, most built-in functions are not implemented in the ptx backend, so it is the exception rather than the norm that any given built-in will work (I think those that work are pow and sqrt, but I'd need to go looking - I can do this if you like).

If you would like to use built-in functions on nvidia hardware, you might be able to try pocl. In the past, some people have managed to get some bits and pieces running through their driver (including myself, though I've not tried recently). If you do give it a go, please let us know!

DuncanMcBain avatar Jan 10 '19 12:01 DuncanMcBain

Hi Duncan,

I think those that work are pow and sqrt, but I'd need to go looking - I can do this if you like

Only if it is not too much work for you. I'm currently experimenting with CUDA, ROCm and SYCL in order to find similarities, differences and obstacles. For this I'm creating some benchmarks as well - I wanted to use rsqrt for an n-body algorithm, for example. A list of supported and unsupported built-ins would greatly help me in the future.

Edit: Out of interest: Why would sqrt work, but rsqrt not? According to the CUDA math guide sqrt is implemented by using rsqrt internally.

If you would like to use built-in functions on nvidia hardware, you might be able to try pocl. In the past, some people have managed to get some bits and pieces running through their driver (including myself, though I've not tried recently). If you do give it a go, please let us know!

I just tried it out, the error looks very similar:

Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x2471a40 device 0 (size 73):
Error(s) while linking: 
Cannot find symbol _Z5rsqrtf in kernel library

j-stephan avatar Jan 10 '19 13:01 j-stephan

Follow-up: 1.f / sqrt(x) doesn't work as well. When I compile for ptx64, I get the following error:

Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x23289d0 device 0 (size 69):
ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'

If I try it with Pocl (spir64) it complains about sqrt in a similar way to the rsqrt problem:

Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x27439a0 device 0 (size 72):
Error(s) while linking: 
Cannot find symbol _Z4sqrtf in kernel library

j-stephan avatar Jan 10 '19 13:01 j-stephan

Another follow-up: The error with 1.f / sqrt() for the PTX backend disappears when changing the optimization level. It appears for -O0 and -O1 and disappears once -O2 or -O3 are switched on. 1.f / sqrt() works for ptx64 in this case, the program exits without crashing.

The optimization level doesn't affect the error generated when using Pocl via spir64. In this case, 1.f / sqrt() still fails with the error mentioned above.

j-stephan avatar Jan 10 '19 15:01 j-stephan

Hi @j-stephan, the problem is the way that we hook the user-side SYCL functions (like cl::sycl::sqrt) to the actual real device functions. We have it totally implemented for SPIR - which is why you see things like _Z4sqrtf - but we don't have it implemented for ptx.

For pocl, I am not totally sure what's happening. I've seen similar issues in the past when trying to use their CUDA backend. I did a little digging and saw that their functions look differently-mangled than what ComputeCpp outputs - but as far as I am aware, we're outputting the correct mangling (at least it works elsewhere). I have been meaning to contact the developers of pocl but haven't had a chance, but you could try their github to see if they have any ideas.

I was going to mention about the get_offset call. Technically in the device code, there are some thread id functions that are called but might not be used. When optimisation is turned on, the compiler can recognise that the data isn't actually used, and therefore can remove the get_offset calls (among others) entirely. FWIW, we always recommend running the device compiler with at least -O2. There are very few situations where you wouldn't want to do this in my opinion.

DuncanMcBain avatar Jan 10 '19 16:01 DuncanMcBain

Thanks @DuncanMcBain, that cleared things up a bit. If I understand this correctly, all that is missing for the PTX backend to work is the correct mapping of cl::sycl::rsqrt() to PTX's rsqrt.f32 in the device compiler? Or are there some more layers in between, e.g. NVIDIA's OpenCL implementation of rsqrt?

OT: Is there a roadmap or so with regard to full PTX support? And I've seen that compute++ seems to be able to target amdgcn. Since this option is not officially documented anywhere AFAIK and I failed to get my codes to run on a Vega 64 or some Fiji-era GPUs, I assume there is no real support yet?

I have been meaning to contact the developers of pocl but haven't had a chance, but you could try their github to see if they have any ideas.

Sounds good, I'll get in touch with them.

FWIW, we always recommend running the device compiler with at least -O2. There are very few situations where you wouldn't want to do this in my opinion.

I agree. For some reason I wrongly assumed that -O2 would be the default.

j-stephan avatar Jan 10 '19 16:01 j-stephan

The PTX backend is missing other bits and pieces, like images. That said, correctly mapping all the builtin functions would go a significant way towards "full" PTX support. From there, it should just work. Unfortunately, I don't know when that is planned on the roadmap. Our main targets are SPIR and SPIRV, since more platforms consume these formats. Similarly, there are plans to have GCN output to the extent that we list it in the output formats, but I don't think it's ready to actually hook up yet. I don't believe we have a release date in mind other than "when it's ready", I'm afraid. We'll be sure to announce it through the normal channels when we do enable it.

I'll keep an eye out on the pocl repository then, and please let me know if there's any more information we can provide.

Defaulting to a higher optimisation level isn't necessarily a bad idea actually, but I guess we're trying to stay as close to a normal compiler as possible.

DuncanMcBain avatar Jan 10 '19 16:01 DuncanMcBain

Hi @j-stephan, I did a little research into the pocl situation last week and it looks like they don't support SPIR on NVIDIA devices at the moment owing to some thorny problems with linking modules together. I'm guessing this currently is not properly handled, resulting in weird link errors like we're seeing.

It would still be nice to know from them if they're thinking of reintroducing support in the long term, but for now my guess is that it just won't work (which is a real shame).

DuncanMcBain avatar Jan 28 '19 12:01 DuncanMcBain

Hi @DuncanMcBain,

thanks, that is good to know. I didn't have time to get in touch with Pocl yet, but I'll keep you updated once my current work is done.

j-stephan avatar Jan 28 '19 12:01 j-stephan

I poked around a little more. Surprisingly, most OpenCL math functions are supported right now, the exceptions being ceil, cos, fabs, floor, rint, round, rsqrt, sin and trunc. Of these cos and sin fail with a different error message (optimization level -O3):

Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x2aba710 device 0 (size 69):
ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'


>


)

I also tried the cl::sycl::native functions. Except for divide, recip, rsqrt and sqrt all of these cause compute++ to crash. Is this related to the issue here or should I open a new bug report for that?

j-stephan avatar Feb 01 '19 17:02 j-stephan

For the first, that shouldn't really change with the maths function used - that is saying that you're asking for the offset of the work-item ID (which is normally zero, but you can still query it). Either you're calling that function, or something isn't inlining properly. If you have a repro, we can take a look.

Crashes are bad! Do you get the message about contacting Codeplay after catching a signal? If so, please do provide us a reproduction, we'll pass it to our compiler guys who can take a look!

DuncanMcBain avatar Feb 01 '19 20:02 DuncanMcBain

For reproduction, use this code:

https://github.com/j-stephan/fpg/blob/master/examples/n-body/sycl/nbody.cpp

Change line 57 to one of the mentioned functions (both sin/cos or the native ones), compile with ComputeCpp 1.0.5:

compute++ nbody.cpp -I/opt/sycl/computecpp/include -L/opt/sycl/computecpp/lib -lComputeCpp -sycl-driver -sycl-target ptx64 -no-serial-memop -O3 -Wall -Wextra -pedantic -std=c++17 -o nbody

Execute and the errors / crashes should appear.

j-stephan avatar Feb 01 '19 21:02 j-stephan

Thanks @j-stephan, I can confirm the crash happens, though I'm trying to reduce it down as a test case at the moment.

DuncanMcBain avatar Feb 04 '19 14:02 DuncanMcBain

Phew, resurrecting old issues - I have a small repro case and have passed it along to our compiler team. I'll update here with any other information.

DuncanMcBain avatar Oct 16 '19 12:10 DuncanMcBain