[CUDA] illegal memory access when using CUDA and large max_bin and large dataset
Description
By using CUDA histogram of the master branch, the simple python code report memory error if it uses large max_bin size
Reproducible example
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)
And it will report error:
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 15000
[LightGBM] [Info] Number of data points in the train set: 3000000, number of used features: 50
[LightGBM] [Info] Start training from score 0.023500
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37
terminate called after throwing an instance of 'std::runtime_error'
what(): [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37
Aborted
Environment info
GPU: NVIDIA GeForce RTX 3060 Python: 3.12.4 LightGBM version or commit hash: master branch
# FROM nvidia/cuda:8.0-cudnn5-devel
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
#################################################################################################################
# Global
#################################################################################################################
# apt-get to skip any interactive post-install configuration steps with DEBIAN_FRONTEND=noninteractive and apt-get install -y
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive
#################################################################################################################
# Global Path Setting
#################################################################################################################
ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib
ENV OPENCL_LIBRARIES /usr/local/cuda/lib64
ENV OPENCL_INCLUDE_DIR /usr/local/cuda/include
#################################################################################################################
# TINI
#################################################################################################################
# Install tini
ENV TINI_VERSION v0.14.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
#################################################################################################################
# SYSTEM
#################################################################################################################
# update: downloads the package lists from the repositories and "updates" them to get information on the newest versions of packages and their
# dependencies. It will do this for all repositories and PPAs.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
vim \
mercurial \
subversion \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++
# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
#################################################################################################################
# CONDA
#################################################################################################################
ARG CONDA_DIR=/opt/miniforge
# add to path
ENV PATH $CONDA_DIR/bin:$PATH
# Install miniforge
RUN echo "export PATH=$CONDA_DIR/bin:"'$PATH' > /etc/profile.d/conda.sh && \
curl -sL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -o ~/miniforge.sh && \
/bin/bash ~/miniforge.sh -b -p $CONDA_DIR && \
rm ~/miniforge.sh
RUN conda config --set always_yes yes --set changeps1 no && \
conda create -y -q -n py3 numpy scipy scikit-learn jupyter notebook ipython pandas matplotlib
#################################################################################################################
# LightGBM
#################################################################################################################
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && \
mkdir build && cd build && cmake -DUSE_CUDA=1 .. && make -j4 && cd ..
ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}
RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --precompile && source deactivate"
#################################################################################################################
# System CleanUp
#################################################################################################################
# apt-get autoremove: used to remove packages that were automatically installed to satisfy dependencies for some package and that are no more needed.
# apt-get clean: removes the aptitude cache in /var/cache/apt/archives. You'd be amazed how much is in there! the only drawback is that the packages
# have to be downloaded again if you reinstall them.
RUN apt-get autoremove -y && apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
conda clean -a -y
#################################################################################################################
# JUPYTER
#################################################################################################################
# password: keras
# password key: --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824'
# Add a notebook profile.
RUN mkdir -p -m 700 ~/.jupyter/ && \
echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py
VOLUME /home
WORKDIR /home
# IPython
EXPOSE 8888
ENTRYPOINT [ "/tini", "--" ]
CMD /bin/bash -c "source activate py3 && jupyter notebook --allow-root --no-browser --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824' && source deactivate"
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Additional Comments
I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.
In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup
fixed_params = {
"objective": "binary",
"metric": "auc",
"boosting_type": "gbdt",
"data_sample_strategy": "bagging",
"num_iterations": 5000,
"device_type": "cuda",
"random_state": 6241,
"force_row_wise": True,
"bagging_seed": 113,
"early_stopping_rounds": 100,
"verbose": 2,
}
gbm = lightgbm.train(
**fixed_params,
train_pool,
valid_sets=[valid_pool],
valid_names=['valid'],
)
Here's the LGBM log before it crashes
Here are my Env Info
-
Driver Version: 535.104.05 CUDA Version: 12.2 -
lightgbm==4.4.0but I have verified that this behavior is the same inv4.2.0. - T4 GPU on colab with 15 gigs of GPU RAM.
I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.
In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup
fixed_params = { "objective": "binary", "metric": "auc", "boosting_type": "gbdt", "data_sample_strategy": "bagging", "num_iterations": 5000, "device_type": "cuda", "random_state": 6241, "force_row_wise": True, "bagging_seed": 113, "early_stopping_rounds": 100, "verbose": 2, } gbm = lightgbm.train( **fixed_params, train_pool, valid_sets=[valid_pool], valid_names=['valid'], ) Here's the LGBM log before it crashes
Here are my Env Info
Driver Version: 535.104.05 CUDA Version: 12.2lightgbm==4.4.0but I have verified that this behavior is the same inv4.2.0.- T4 GPU on colab with 15 gigs of GPU RAM.
May I have your dataset? I wanna try and solve this question.
我也有这个问题,我的训练数据大概有500m,环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版,现在cuda版本报这个错,请问我该如何处理。
版本:
lightgbm:4.5.0.99
ubuntu:24
python:3.10
cuda:12.2
gpu:2080ti 12gb
错误: [flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv [flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy [flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm'] [flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s. [flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37
terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37
我也有这个问题,我的训练数据大概有500m,环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版,现在cuda版本报这个错,请问我该如何处理。
版本: lightgbm:4.5.0.99 ubuntu:24 python:3.10 cuda:12.2 gpu:2080ti 12gb
错误: [flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv [flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy [flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm'] [flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s. [flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37
terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37
I'm running the examples " from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split import lightgbm as lgbm X,y = make_regression(n_samples=4000000, n_features=50) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) model = lgbm.LGBMRegressor(device="cuda", max_bin=300) model.fit(X_train, y_train) " in a server with gpu and cuda currently. I will try to locate and solve this problem in the near future.
Thanks for reporting this issue. With large max bin values, kernels constructing histogram using global memory may be used, which has not been tested heavily. I'm debugging this.
I've identified the root of this issue. I will create a PR to fix this soon.
I didn't find a related PR - maybe it's not created yet? Could you provide any guidance on how we might address this issue at the moment?
I've identified the root of this issue. I will create a PR to fix this soon.
@shiyu1994 is this issue addressed in any of the open PRs?
If not, could you please put up a fix? Or if you don't have time, could you describe the problem in enough detail here that someone else could try to put up a fix?