Jetson Nano with Tegra X1 and PoCL 3.0 works neither with CPU nor GPU
The older Jetson Nano from 2019 with Maxwell architecture and compute capability 5.3 does not have an OpenCL driver. As workaround it is possible to compile PoCL 3.0 on the Jetson. My procedure can be found at github.com/kreier/jetson. The errors could therefore be related to PoCL, but maybe there is a fix to execute this benchmark. Here are my observations after compiling:
mk@nano:~/download/OpenCL-Benchmark$ ./make.sh
/usr/bin/ld: skipping incompatible ./src/OpenCL/lib/libOpenCL.so when searching for -lOpenCL
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | pthread-cortex-a57 |
| Device ID 1 | NVIDIA Tegra X1 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | pthread-cortex-a57 |
| Device Vendor | ARM |
| Device Driver | 3.0-rc2 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 4 at 1479 MHz (4 cores, 0.189 TFLOPs/s) |
| Memory, Cache | 2972 MB RAM, 2048 KB global / 32 KB local |
| Buffer Limits | 1024 MB global, 32 KB constant |
|----------------'------------------------------------------------------------|
2 errors generated.
| Warning: error: /home/mk/.cache/pocl/kcache/tempfile_cpbIbi.cl:31:1: |
| implicit declaration of function 'asm' is invalid in OpenCL error: |
| /home/mk/.cache/pocl/kcache/tempfile_cpbIbi.cl:31:32: expected ')' |
| Device pthread-cortex-a57 failed to build the program |
| Error: OpenCL C code compilation failed with error code -11. Make sure |
| there are no errors in kernel.cpp. |
'-----------------------------------------------------------------------------'
The CPU version throws an error claiming unvalid OpenCL code. No idea what to do. Or why libOpenCL.so is incompatible. So I tried to run on the GPU:
mk@nano:~/download/OpenCL-Benchmark$ ./bin/OpenCL-Benchmark 1
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | pthread-cortex-a57 |
| Device ID 1 | NVIDIA Tegra X1 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | NVIDIA Tegra X1 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 3.0-rc2 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 1 at 921 MHz (128 cores, 0.236 TFLOPs/s) |
| Memory, Cache | 3962 MB RAM, 0 KB global / 48 KB local |
| Buffer Limits | 990 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Error: Memory size is too large at 1024 MB. Device "NVIDIA Tegra X1" |
| accepts a maximum buffer size of 990 MB. |
'-----------------------------------------------------------------------------'
This looks much better! The Jetson Nano has 4 GB unified memory, so memory should not be a problem. The setting export POCL_MEMORY_LIMIT=2 had no effect. Is there a way to affect the amount of available GPU memory? My M1000M has only 2GB and reports only 497 MB global Buffer Limits. Yet the benchmark compiles and runs without problems:
mk@zbook:~/Downloads/OpenCL-Benchmark$ ./make.sh
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | Quadro M1000M |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Quadro M1000M |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 550.163.01 (Linux) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 4 at 1071 MHz (512 cores, 1.097 TFLOPs/s) |
| Memory, Cache | 1990 MB VRAM, 96 KB global / 48 KB local |
| Buffer Limits | 497 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.036 TFLOPs/s (1/32) |
| FP32 compute 0.736 TFLOPs/s (2/3 ) |
| FP16 compute not supported |
| INT64 compute 0.195 TIOPs/s (1/8 ) |
| INT32 compute 0.308 TIOPs/s (1/4 ) |
| INT16 compute 1.074 TIOPs/s ( 1x ) |
| INT8 compute 0.209 TIOPs/s (1/4 ) |
| Memory Bandwidth ( coalesced read ) 72.55 GB/s |
| Memory Bandwidth ( coalesced write) 73.27 GB/s |
| Memory Bandwidth (misaligned read ) 24.91 GB/s |
| Memory Bandwidth (misaligned write) 10.80 GB/s |
| PCIe Bandwidth (send ) 9.07 GB/s |
| PCIe Bandwidth ( receive ) 8.91 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 8.65 GB/s |
|-----------------------------------------------------------------------------|
'-----------------------------------------------------------------------------'
Another data point is another OpenCL benchmark: clpeak https://github.com/krrishnarraj/clpeak returned benchmark results for Global memory bandwidth, Single-precision compute, Half-precision compute (only CPU), Double-precision compute and Integer compute. The measured values reflect expected values. So to a degree the OpenCL functions are available and can be used for computing and benchmarking - both on CPU and GPU with PoCL 3.0.