Runtime error: Check failed: IsAligned()
Thanks for the great work on tensorflow-opencl. It's really great.
Summary
I'm getting a runtime error for almost all tensorflow programs:
2017-03-18 12:52:52.241954: F ./tensorflow/core/framework/tensor.cc:488] Check failed: IsAligned()
Aborted (core dumped)
Environment Description
I have an Intel HD Graphics 5500 GPU with the Intel Broadwell i5 CPU x64. I'm using Intel's OpenCL drivers from here: https://software.intel.com/en-us/articles/opencl-drivers .
The OS is Ubuntu 16.04 LTS.
Python version is 3.5. It's running in a conda environment using Anaconda's versions of python, numpy, scipy, pyyaml, h5py, pandas, and jupyter.
However, the tensorflow pip package was compiled using Ubuntu's version of everything as per the compile from source instructions. I disabled Anaconda by removing it from ~/.bashrc, compiled the pip package, re-enabled Anaconda, activated the conda environment, and installed the pip package into the conda environment.
Steps to Reproduce
Here's the only tensorflow program I tried that did not fail:
import random
import sys
import tensorflow as tf
import time
random_number_generator = random.SystemRandom()
NUM_ROWS = 1024
NUM_COLUMNS = 1024
a_array = []
for i in range(1, (NUM_ROWS * NUM_COLUMNS) - 1):
a_array.append(random_number_generator.random())
b_array = []
for i in range(1, (NUM_ROWS * NUM_COLUMNS) - 1):
b_array.append(random_number_generator.random())
# Creates a graph.
with tf.device('/device:SYCL:0'):
a = tf.constant(a_array, shape=[NUM_ROWS, NUM_COLUMNS], name='a', dtype=tf.float64)
b = tf.constant(b_array, shape=[NUM_COLUMNS, NUM_ROWS], name='b', dtype=tf.float64)
c = tf.matmul(a, b)
sess = tf.Session()
start = time.time()
sess.run(c)
Changing NUM_ROWS and NUM_COLUMNS to even 1200 resulted in the error above.
I also installed keras into the same conda environment using pip install keras and ran this script: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py . This resulted in the same error: Check failed: IsAligned(). The error is displayed after Build model... is outputted to the console.
Commit Hash (git rev-parse HEAD)
dda6b4ee253ca3016841ff60b16df4be40b5b052
Bazel Version
...........
Build label: 0.4.5
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Mar 16 12:19:38 2017 (1489666778)
Build timestamp: 1489666778
Build timestamp as int: 1489666778
clinfo
Number of platforms 1
Platform Name Intel(R) OpenCL
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 2.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir
Platform Extensions function suffix INTEL
Platform Name Intel(R) OpenCL
Number of devices 2
Device Name Intel(R) HD Graphics
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 2.0
Driver Version r4.0.59481
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Profile FULL_PROFILE
Max compute units 24
Max clock frequency 900MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types by <unknown> (0x7FF200000000)
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 32
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 1 / 1
half 8 / 8 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 13231777383 (12.32GiB)
Error Correction support No
Max memory allocation 4294959103 (4GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 64 bytes
Global 64 bytes
Local 64 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 4294959103 (4GiB)
Global Memory cache type Read/Write
Global Memory cache size 589824
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 268434943 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4 bytes
Pitch alignment for 2D image buffers 4 bytes
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 128
Max number of read/write image args 128
Max number of pipe args 16
Max active pipe reservations 1
Max pipe packet size 1024
Local memory type Local
Local memory size 65536 (64KiB)
Max constant buffer size 4294959103 (4GiB)
Max number of constant args 8
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 131072 (128KiB)
Max size 67108864 (64MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
SPIR versions 1.2
printf() buffer size 4194304 (4MiB)
Built-in kernels block_motion_estimate_intel;block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidirectional_check_intel
Motion Estimation accelerator version (Intel) 2
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_device_side_avc_motion_estimation cl_intel_driver_diagnostics cl_intel_media_block_io cl_intel_motion_estimation cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_subgroups cl_intel_va_api_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_spir
Device Name Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 2.0 (Build 400)
Driver Version 1.2.0.400
Device OpenCL C Version OpenCL C 2.0
Device Type CPU
Device Profile FULL_PROFILE
Max compute units 4
Max clock frequency 2200MHz
Device Partition (core)
Max number of sub-devices 4
Supported partition types by counts, equally, by names (Intel)
Max work item dimensions 3
Max work item sizes 8192x8192x8192
Max work group size 8192
Preferred work group size multiple 128
Preferred / native vector sizes
char 1 / 32
short 1 / 16
int 1 / 8
long 1 / 4
half 0 / 0 (n/a)
float 1 / 8
double 1 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 16550207488 (15.41GiB)
Error Correction support No
Max memory allocation 4137551872 (3.853GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 64 bytes
Global 64 bytes
Local 0 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 65536 (64KiB)
Global Memory cache type Read/Write
Global Memory cache size 262144
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 480
Max size for 1D images from buffer 258596992 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 64 bytes
Pitch alignment for 2D image buffers 64 bytes
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 480
Max number of write image args 480
Max number of read/write image args 480
Max number of pipe args 16
Max active pipe reservations 65535
Max pipe packet size 1024
Local memory type Global
Local memory size 32768 (32KiB)
Max constant buffer size 131072 (128KiB)
Max number of constant args 480
Max size of kernel argument 3840 (3.75KiB)
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Local thread execution (Intel) Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 4294967295 (4GiB)
Max size 4294967295 (4GiB)
Max queues on device 4294967295
Max events on device 4294967295
Prefer user sync for interop No
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [INTEL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
computecpp_info
********************************************************************************
ComputeCpp Info (CE 0.1.2)
********************************************************************************
Toolchain information:
GLIBCXX: 20150426
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 1 devices matching:
platform : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:
Device is supported : UNTESTED - Device not tested on this OS
CL_DEVICE_NAME : Intel(R) HD Graphics
CL_DEVICE_VENDOR : Intel(R) Corporation
CL_DRIVER_VERSION : r4.0.59481
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
********************************************************************************
********************************************************************************
********************************************************************************
Hi @jarrellmark
Thanks for reporting this!
That has been addressed in 5cc8cdd58f3324c81eeab3e9a0af47754716e7fc. Could you please try it out?
SYCL default allocator did not take alignment into consideration. That now has been addressed in Eigen, where we are passing the required alignment to the custom allocator. C++ is great!
Thanks,
Hey @lukeiwanski,
The IsAligned() message went away, but I'm getting this message now:
2017-03-20 21:30:14.765017: W ./tensorflow/core/common_runtime/sycl/sycl_util.h:44] No OpenCL GPU found that is supported by ComputeCpp, trying OpenCL CPU
Is there a way to force the GPU?
Currently we have an issue with memory alignment on Intel GPUs and have set the Intel GPU as "blacklisted" in Eigen. This means Eigen will not try to target Intel GPUs at the moment. We are working on a resolution for this and will update you when we have a fix available.
Thanks, Luke.
I appreciate it and am excited about the progress that tensorflow-opencl is making.
Hi @lukeiwanski,
I'm having the same issue and was wondering if you have added Eigen support for Intel GPUs yet. If not, is there some way I can un-blacklist the Intel GPU?
Thanks for your hard work on this project!
Can you give it a spin on this branch: https://github.com/lukeiwanski/tensorflow/tree/dev/eigen_mehdi ?
That fixed the error, thanks a lot!
An unrelated question, tensorflow keeps telling me I'm running on a SYCL device, but then it calls that device a CPU. When I run sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)), I get the following output:
/job:localhost/replica:0/task:0/device:SYCL:0 -> id: 0, type: CPU, name: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz, vendor: Intel(R) Corporation, profile: FULL_PROFILE
Running tensorflow.python.client.device_lib.list_local_devices() gives me the following:
[name: "/cpu:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 177593382533810523, name: "/device:SYCL:0" device_type: "SYCL" memory_limit: 268435456 locality { } incarnation: 1258559034356206920 physical_device_desc: "id: 0, type: CPU, name: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz, vendor: Intel(R) Corporation, profile: FULL_PROFILE"]
However, this device is NOT my GPU, as can be seen from when I run clinfo:
Platform Name Intel(R) OpenCL Number of devices 2 Device Name Intel(R) HD Graphics Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 Driver Version r5.0.63503 Device OpenCL C Version OpenCL C 2.0 Device Type GPU Device Profile FULL_PROFILE
. . .
Device Name Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 (Build 475) Driver Version 1.2.0.475 Device OpenCL C Version OpenCL C 2.0 Device Type CPU Device Profile FULL_PROFILE
Thanks for all your help already!
However, I am later getting this error when I try to run a simple keras model (just 2 dense layers): InternalError: Unknown error detected on device /job:localhost/replica:0/task:0/device:SYCL:0
That's interesting.. could you provide code to reproduce that issue?
I'm having trouble reproducing this issue because the code seems to just be hanging (I'm getting a lot of these messages: ./tensorflow/core/common_runtime/executor.cc:1556] Process node: 48 step 2 mul_3 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:SYCL:0"](beta_1/read, Variable/read) is dead: 0
But here's my code: from keras.models import Sequential from keras.layers import Dense from keras import optimizers model = Sequential() model.add(Dense(32, input_shape=(timesteps, D_in))) model.add(Dense(D_out)) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=N, epochs=5, validation_data=(X_test, y_test))
Ah, ok I've reproduced the earlier error by using LSTM layers. It may be unreasonable for me to expect LSTM layers to work, but I am also having trouble with just dense layers (see above). Here's my code:
from keras.models import Sequential from keras.layers import LSTM, Dense from keras.layers.wrappers import TimeDistributed from keras import optimizers model = Sequential() model.add(LSTM(32, return_sequences=True, input_dim=D_in, input_length=timesteps)) model.add(LSTM(32, return_sequences=True)) model.add(TimeDistributed(Dense(D_out, activation='softmax'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=N, epochs=1, validation_data=(X_test, y_test))
And here's a trace of the error message:
InternalError Traceback (most recent call last)
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs) 861 class_weight=class_weight, 862 sample_weight=sample_weight, --> 863 initial_epoch=initial_epoch) 864 865 def evaluate(self, x, y, batch_size=32, verbose=1,
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs) 1428 val_f=val_f, val_ins=val_ins, shuffle=shuffle, 1429 callback_metrics=callback_metrics, -> 1430 initial_epoch=initial_epoch) 1431 1432 def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/keras/engine/training.pyc in _fit_loop(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch) 1077 batch_logs['size'] = len(batch_ids) 1078 callbacks.on_batch_begin(batch_index, batch_logs) -> 1079 outs = f(ins_batch) 1080 if not isinstance(outs, list): 1081 outs = [outs]
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.pyc in call(self, inputs) 2266 updated = session.run(self.outputs + [self.updates_op], 2267 feed_dict=feed_dict, -> 2268 **self.session_kwargs) 2269 return updated[:len(self.outputs)] 2270
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata) 887 try: 888 result = self._run(None, fetches, feed_dict, options_ptr, --> 889 run_metadata_ptr) 890 if run_metadata: 891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata) 1116 if final_fetches or final_targets or (handle and feed_dict_tensor): 1117 results = self._do_run(handle, final_targets, final_fetches, -> 1118 feed_dict_tensor, options, run_metadata) 1119 else: 1120 results = []
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1313 if handle is None: 1314 return self._do_call(_run_fn, self._session, feeds, fetches, targets, -> 1315 options, run_metadata) 1316 else: 1317 return self._do_call(_prun_fn, self._session, handle, feeds, fetches)
/home/nicholas/.virtualenvs/tensorflow-luke/local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args) 1332 except KeyError: 1333 pass -> 1334 raise type(e)(node_def, op, message) 1335 1336 def _extend_graph(self):
InternalError: Unknown error detected on device /job:localhost/replica:0/task:0/device:SYCL:0
Hi @lukeiwanski
I am having the same issue (Check failed: IsAligned()) with tf-coriander (https://github.com/hughperkins/tf-coriander) using a Mali T-728 GPU.
Do you have a patch I could try to fix this? Or an advice on how to go about fixing it? Thanks!
Ping on this issue