blocksparse icon indicating copy to clipboard operation
blocksparse copied to clipboard

Has anyone correctly build from source and run `test/blocksparse_matmul_test.py`?

Open ruiwang2uber opened this issue 6 years ago • 4 comments

Could you please share an exact setup and how you did it? has been struggled to resolve this error after i compiled blocksparse from source.

If you managed to have pip install works, please also share.

tensorflow.python.framework.errors_impl.NotFoundError: /home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: _ZNK10tensorflow8OpKernel4nameB5cxx11Ev

ruiwang2uber avatar Mar 14 '19 03:03 ruiwang2uber

Could the author consider to release a docker file?

ruiwang2uber avatar Mar 14 '19 05:03 ruiwang2uber

Ok, after many trails, this is what worked for me on Ubuntu 18.04, cuda-10 and anaconda: First install g++ 5. Because tensorflow-gpu installed using pip is compiled using g++ 5.4: In Ubuntu 18.04, you can install g++ 5 as follows: $ sudo apt-get install g++-5

Clone the blocksparse repo and in Makefile change "g++" to "g++-5" (without quotes) Now create a new virtual environment using conda and Python 3.6 (Use 3.6 not 3.7). Activate this environment and install tensorflow-gpu using pip. Now you can compile blocksparse in this environment.

krishnadubba avatar May 22 '19 14:05 krishnadubba

In case this helps anyone, I created the following Dockerfile and instructions that worked for me:

Dockerfile (place this in root of the blocksparse repo):

FROM tensorflow/tensorflow:1.15.2-gpu-py3
RUN pip install --upgrade pip
RUN pip3 install tensorflow-gpu==1.13.1

# Need this to run the tests
RUN pip3 install networkx==2.5

ENV NCCL_VERSION=2.4.8-1+cuda10.0
RUN apt-get update && apt-get install -y --no-install-recommends \
  mpich \
  libmpich-dev \
  libnccl2=${NCCL_VERSION} \
  libnccl-dev=${NCCL_VERSION} \
  curl

# Make sure the linker knows where to look for things
ENV LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"

Instructions (you might need to run these commands with sudo): NOTE:

  • commands prefixed by $ should be run in a shell on the host machine
  • commands prefixed by # should be run in an interactive shell in the docker container
  1. Build image
$ docker image build -f Dockerfile --rm -t blocksparse:local .
  1. Start docker container with an interactive terminal, Choose the relevant CPU/GPU option below

CPU

  • the tests below will fail if you try to run them without GPU support
  • the ln command should be run inside the docker container
$ docker run -it --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local
# ln -s /usr/local/cuda/compat/libcuda.so /usr/lib/libcuda.so

GPU

$ docker run -it --gpus all --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local
  1. Compile (inside the docker container)
# make compile
  1. Install compiled version (inside the docker container)
# pip3 install dist/*.whl
  1. Test compiled version (inside the docker container)
# test/blocksparse_matmul_test.py
# test/blocksparse_conv_test.py

jlozano avatar Dec 30 '20 06:12 jlozano

In Choose the relevant CPU/GPU option, I had the error as Error response from daemon: could not select device driver "" with capabilities: [[gpu]].. Any thoughts on this ?

If I choose CPU version, then the error is InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'FloatCast' used by node FloatCast (defined at <string>:5598) with these attrs: [TX=DT_FLOAT, dx_dtype=DT_FLOAT, TY=DT_HALF]. How to fix this ?

Victordongy avatar Mar 29 '21 02:03 Victordongy