NBFNet icon indicating copy to clipboard operation
NBFNet copied to clipboard

[Feature Request] `Dockerfile` / `environment.yml` for better reproducibility

Open SauravMaheshkar opened this issue 4 years ago • 7 comments

Congratulations to the authors for NeurIPS'21, looking forward to your talk during LoGaG


While installing the project on VMs and local systems, I've been running into multiple issues getting the correct package versions installed. Be it CUDA errors while installing torch-scatter and torchdrug or simply pybind11 issues. Having a Dockerfile would help out with preventing such errors and make reproducibility + experimentation easier.

I think it'd be easier and better for there to be a Docker image for torchdrug itself and then the image for NBFNet would just use that as the base image. More than happy to take this up.

This way one could also use the nvidia container toolkit for running experiments across multiple GPUs/nodes easily.

SauravMaheshkar avatar Jan 03 '22 20:01 SauravMaheshkar

Hi! That's a great suggestion! We can add a environment.yml for NBFNet soon. For the Docker image, we are not so familiar with the steps, and will probably take some time to figure it out. Like you said, a docker image is easy to launch experiments across multiple nodes, so we will definitely add one for torchdrug.

KiddoZhu avatar Jan 04 '22 19:01 KiddoZhu

I just built this Dockerfile over on my fork of the repository.

# syntax=docker/dockerfile:1.2
# To build the image use :-
# $ DOCKER_BUILDKIT=1 docker build .
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

# metainformation
LABEL version="0.0.1"
LABEL maintainer="Saurav Maheshkar"

# Helpers
ARG DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

WORKDIR /code
COPY . .

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel
RUN pip3 install --no-cache-dir torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
RUN pip3 install --no-cache-dir torchdrug
RUN pip3 install --no-cache-dir -r requirements.txt

RUN find /opt/conda/lib/ -follow -type f -name '*.a' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.pyc' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.txt' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.mc' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.js.map' -delete \
    && find /opt/conda/lib/ -name '*.c' -delete \
    && find /opt/conda/lib/ -name '*.pxd' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.md' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.png' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.jpg' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.jpeg' -delete \
    && find /opt/conda/lib/ -name '*.pyd' -delete \
    && find /opt/conda/lib/ -name '__pycache__' | xargs rm -r

ENV PATH /opt/conda/bin:$PATH

Thoughts on this @KiddoZhu ?

SauravMaheshkar avatar Jan 10 '22 20:01 SauravMaheshkar

Thanks for the recipe. I just learned some basics of Docker. It looks like I can't import torchdrug correctly with this Dockerfile. It says libXrender.so is missing, which is required by rdkit. Besides, the JIT compliation used in torchdrug relies on nvcc so I guess we need a development version of PyTorch image. I will figure it out.

KiddoZhu avatar Jan 15 '22 22:01 KiddoZhu

Here is my Dockerfile for torchdrug. We have to rely on the development version of PyTorch to use JIT in torchdrug (and also required by NBFNet).

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel

RUN apt-get update && \
    apt-get install -y libxrender1 && \
    rm -rf /var/lib/apt/lists/*

RUN pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html  && \
    pip install torchdrug

I am not familiar with how to prune the size of the image. I feel your find ... -delete commands look a little bit unsafe. Have you tested that? @SauravMaheshkar

KiddoZhu avatar Jan 16 '22 17:01 KiddoZhu

Yes I have tested that, but in my experience it doesn't contribute much towards decreasing image size. It might be better just to use multi-stage builds, maybe something like :-

# Builder Image
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel AS builder

....

# Runner Image
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel AS runner

...

the find ... -delete reduces file size by 5 - 15 MB. It might be better to use docker dive.

There's a hyper optimized Dockerfile I work with which can be found here.

SauravMaheshkar avatar Mar 22 '22 15:03 SauravMaheshkar

Any updates ? @KiddoZhu

SauravMaheshkar avatar May 16 '22 16:05 SauravMaheshkar

Might I also suggest the addition of ENV PATH /opt/conda/bin:$PATH at the end of the Dockerfile, the PyTorch Docker Image uses conda to handle the python interpreter without it's addition the underlying libraries aren't accessible by default.

SauravMaheshkar avatar Jun 07 '22 21:06 SauravMaheshkar