Jetson Nano with Tegra X1 and PoCL 3.0 works neither with CPU nor GPU

Open kreier opened this issue 9 months ago • 0 comments

The older Jetson Nano from 2019 with Maxwell architecture and compute capability 5.3 does not have an OpenCL driver. As workaround it is possible to compile PoCL 3.0 on the Jetson. My procedure can be found at github.com/kreier/jetson. The errors could therefore be related to PoCL, but maybe there is a fix to execute this benchmark. Here are my observations after compiling:

mk@nano:~/download/OpenCL-Benchmark$ ./make.sh
/usr/bin/ld: skipping incompatible ./src/OpenCL/lib/libOpenCL.so when searching for -lOpenCL
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | pthread-cortex-a57                                         |
| Device ID    1 | NVIDIA Tegra X1                                            |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | pthread-cortex-a57                                         |
| Device Vendor  | ARM                                                        |
| Device Driver  | 3.0-rc2 (Linux)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 4 at 1479 MHz (4 cores, 0.189 TFLOPs/s)                    |
| Memory, Cache  | 2972 MB RAM, 2048 KB global / 32 KB local                  |
| Buffer Limits  | 1024 MB global, 32 KB constant                             |
|----------------'------------------------------------------------------------|
2 errors generated.
| Warning: error: /home/mk/.cache/pocl/kcache/tempfile_cpbIbi.cl:31:1:        |
|          implicit declaration of function 'asm' is invalid in OpenCL error: |
|          /home/mk/.cache/pocl/kcache/tempfile_cpbIbi.cl:31:32: expected ')' |
|          Device pthread-cortex-a57 failed to build the program              |
| Error: OpenCL C code compilation failed with error code -11. Make sure      |
|        there are no errors in kernel.cpp.                                   |
'-----------------------------------------------------------------------------'

The CPU version throws an error claiming unvalid OpenCL code. No idea what to do. Or why libOpenCL.so is incompatible. So I tried to run on the GPU:

mk@nano:~/download/OpenCL-Benchmark$ ./bin/OpenCL-Benchmark 1
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | pthread-cortex-a57                                         |
| Device ID    1 | NVIDIA Tegra X1                                            |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | NVIDIA Tegra X1                                            |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 3.0-rc2 (Linux)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 1 at 921 MHz (128 cores, 0.236 TFLOPs/s)                   |
| Memory, Cache  | 3962 MB RAM, 0 KB global / 48 KB local                     |
| Buffer Limits  | 990 MB global, 64 KB constant                              |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Error: Memory size is too large at 1024 MB. Device "NVIDIA Tegra X1"        |
|        accepts a maximum buffer size of 990 MB.                             |
'-----------------------------------------------------------------------------'

This looks much better! The Jetson Nano has 4 GB unified memory, so memory should not be a problem. The setting export POCL_MEMORY_LIMIT=2 had no effect. Is there a way to affect the amount of available GPU memory? My M1000M has only 2GB and reports only 497 MB global Buffer Limits. Yet the benchmark compiles and runs without problems:

mk@zbook:~/Downloads/OpenCL-Benchmark$ ./make.sh
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | Quadro M1000M                                              |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Quadro M1000M                                              |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 550.163.01 (Linux)                                         |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 4 at 1071 MHz (512 cores, 1.097 TFLOPs/s)                  |
| Memory, Cache  | 1990 MB VRAM, 96 KB global / 48 KB local                   |
| Buffer Limits  | 497 MB global, 64 KB constant                              |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.036 TFLOPs/s (1/32) |
| FP32  compute                                         0.736 TFLOPs/s (2/3 ) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.195  TIOPs/s (1/8 ) |
| INT32 compute                                         0.308  TIOPs/s (1/4 ) |
| INT16 compute                                         1.074  TIOPs/s ( 1x ) |
| INT8  compute                                         0.209  TIOPs/s (1/4 ) |
| Memory Bandwidth ( coalesced read      )                         72.55 GB/s |
| Memory Bandwidth ( coalesced      write)                         73.27 GB/s |
| Memory Bandwidth (misaligned read      )                         24.91 GB/s |
| Memory Bandwidth (misaligned      write)                         10.80 GB/s |
| PCIe   Bandwidth (send                 )                          9.07 GB/s |
| PCIe   Bandwidth (   receive           )                          8.91 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    8.65 GB/s |
|-----------------------------------------------------------------------------|
'-----------------------------------------------------------------------------'

Another data point is another OpenCL benchmark: clpeak https://github.com/krrishnarraj/clpeak returned benchmark results for Global memory bandwidth, Single-precision compute, Half-precision compute (only CPU), Double-precision compute and Integer compute. The measured values reflect expected values. So to a degree the OpenCL functions are available and can be used for computing and benchmarking - both on CPU and GPU with PoCL 3.0.

Apr 21 '25 06:04 kreier