Check failed: (best_split_info.left_count) > (0)
Hi, I found a bug when training with large X_train.
lgb-gpu version: 3.3.2 CUDA=11.1 CentOS ram=2TB GPU=A100-40G
when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this: [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686
When I use LGB==3.2.1 , I have the same problem as #4480 : when the GPU memory more than 8.3G will Memory Object Allocation Failure
in this LGB version==3.3.2 , LGB can't load more than 17G GPU memory (GPU has 40G memory), it seems like something problem occur in the tree split step and only happened when GPU memory loaded more than 17G.
Another colleague has this same problem, his lgb version is 3.3.1
Thanks very much for using LightGBM!
I suspect that this is the same as other issues that have been reported (e.g. #4739, #3679), but it's hard to say without more details.
Are you able to provide any (and hopefully all) of the following?
- specific commands you used to install LightGBM
- the most minimal possible complete version of the code you're running which can reproduce this issue
That would be very helpful. Without such information and with only an error message, significant investigation will probably be required to figure out why you encountered this error.
Thanks for your reply!
I build the source code in this command: mkdir build ; cd build ; cmake -DUSE_GPU=1 .. && make -j
Sorry I can't paste my code because the code is on my comany machine. I can't screenshots or copy the code, but I can descript my code: X = np.random.rand(18000000, 1000).astype('float32') y = np.random.rand(18000000).astype('float32') model = LGBMRegressor(**params) model.fit(X, y)
That's ok, completely understand that the code might be sensitive.
Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.
params = { ‘n_estimators’ : 500, 'learning_rate' : 0.05, 'subsample_freq' : 6 , 'subsample' : 0.91, 'colsample_bytree' : 0.83, 'colsample_bynode' : 0.78, 'num_leaves': 64, 'max_depth' : 8, 'reg_alpha' : 9, 'reg_lambda' : 3.5, 'min_child_samples' : 200, 'min_child_weight' : 88, 'max_bin': 71, 'enable_sparse': false, 'device_type': "gpu", 'gpu_use_dp': false }
That's ok, completely understand that the code might be sensitive.
Are you able to share the values for
params? Configuration you're using might give us some clues to help narrow this down.
Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?
I personally don't have easy access to hardware like that. I might try at some point to get a VM from a cloud provider and work on some of the open GPU-specific issues in this project, but can't commit to that. Maybe some other maintainer or contributor will be able to help you.
That's ok, completely understand that the code might be sensitive. Are you able to share the values for
params? Configuration you're using might give us some clues to help narrow this down.Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?
I have A100-40G and the above code fails. interestingly, when I run it on 9M rows instead of 18M it doesn't fail. a quick calculation shows that 9M rows is just below the 40GB mark so might be related, might not. In my own code I get this error unless I lower num_leaves and have to tweak it again if I change max_bin, if this helps. This is also company property so won't be able to share the data or code. Other than that, will help any way I can.
shared some details on the issue in another topic https://github.com/microsoft/LightGBM/issues/2793#issuecomment-1305260660
I was able to reproduce this.
Details are below, but running the @chixujohnny sample code:
[LightGBM] [Info] Number of data points in the train set: 18000000, number of used features: 1000
[LightGBM] [Info] Using requested OpenCL platform 0 device 7
[LightGBM] [Info] Using GPU Device: NVIDIA A100-SXM4-40GB, Vendor: NVIDIA Corporation
dies with
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /home/me/src/LightGBM/src/treelearner/serial_tree_learner.cpp, line 682
If I switch from OpenCL to CUDA ('device_type': 'cuda'), then CUDA dies reporting out of memory.
Note: while CUDA does report out of memory, execution continues with repeated FATAL "out of memory log" messages and finally ending in a segfault. -- so the CUDA tree learner also needs some more robust error checking)
Details (for the curious):
# this recreates the error
import numpy as np
from lightgbm import LGBMRegressor
import sys
params = {
'n_estimators': 500,
'learning_rate': 0.05,
'subsample_freq': 6,
'subsample': 0.91,
'colsample_bytree': 0.83,
'colsample_bynode': 0.78,
'num_leaves': 64,
'max_depth': 8,
'reg_alpha': 9,
'reg_lambda': 3.5,
'min_child_samples': 200,
'min_child_weight': 88,
'max_bin': 71,
'enable_sparse': False,
'device_type': "gpu",
'gpu_use_dp': False,
'gpu_platform_id': 0,
'gpu_device_id': 7,
}
n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")
# pre-generated data (see below)
X = np.memmap('X.mmap', dtype='float32', mode='r', shape=(n,m))
y = np.memmap('y.mmap', dtype='float32', mode='r', shape=(n,))
model = LGBMRegressor(**params)
model.fit(X, y)
# building LightGBM
#build boost
wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/boost_1_81_0.tar.gz
tar xvf boost_1_81_0.tar.gz
cd boost_1_81_0
./bootstrap.sh --prefix=/home/me/install
./b2 install
#build LightGBM
# version
#commit e4231205a3bac13662a81db9433ddaea8924fbce (HEAD -> master, origin/master, origin/HEAD)
#Author: James Lamb <[email protected]>
#Date: Tue Feb 28 23:35:20 2023 -0600
#
# [python-package] use keyword arguments in predict() calls (#5755)
cmake .. -DUSE_CUDA=YES -DUSE_GPU=YES -DBOOST_ROOT:PATHNAME=/home/me/install/
#install python
conda create -n LGBM python=3.9 numpy scipy scikit-learn
conda activate LGBM
export PYTHONPATH=$(pwd)/python-package
# data generation
# why? I hit some weird issue where generating random numbers goes from ~3ns per sample for a few 1million
# to ~200ns per sample for 18e9. The issue is mitigated by generating into memapped memory (~10ns per sample)
# mystery for another day....
import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns
n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")
rng = np.random.default_rng()
print("generating data...")
s = time_ns()
X = rng.random((n,m), 'float32')
X = np.memmap('X.mmap', dtype='float32', mode='w+', shape=(n,m))
rng.random((n, m), 'float32', out=X)
e = time_ns()
print(f" {(e-s)/samples:.2f} ns per sample, {(e-s)/1e9:.3f} s elapsed")
print("flushing")
X.flush()
y = np.memmap('y.mmap', dtype='float32', mode='w+', shape=(n,))
rng.random((n, ), 'float32', out=y)
y.flush()
Hi @jameslamb and others,
I've attached a log that demonstrates the [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) error.
(w/ GPU_DEBUG=5, 'verbose': 3, 'seed'=1234, 'gpu_use_dp': True -- full script below).
It is possible for me to deterministically recreate this error.
I was wondering if you had any pointers about how to go about debugging this further.
- Is
left_count==0usually indicative of a NaN or Inf somewhere? -
GPUTreeLearner::ConstructHistogramsandSerialTreeLearner::FindBestSplitsFromHistogramsseem to to bee the places to look, but I'm not entirely sure what to look for...
Thanks in advance for any advice!
code for above
import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns
params = {
'n_estimators': 500,
'learning_rate': 0.05,
'subsample_freq': 6,
'subsample': 0.91,
'colsample_bytree': 0.83,
'colsample_bynode': 0.78,
'num_leaves': 64,
'max_depth': 8,
'reg_alpha': 9,
'reg_lambda': 3.5,
'min_child_samples': 200,
'min_child_weight': 88,
'max_bin': 71,
'enable_sparse': False,
'device_type': "gpu",
'gpu_use_dp': True,
'gpu_platform_id': 0,
'gpu_device_id': 7,
'seed': 1234,
'verbose': 3
}
n = 18*1000*1000
m = 1000
if len(sys.argv) == 2:
n = int(sys.argv[1])
samples = n*m
print(f"N: {n:,}, Samples: {samples:,}")
X = np.memmap('/raid/scratch/nehalp/X64.mmap', dtype='float64', mode='r', shape=(n,m))
y = np.memmap('/raid/scratch/nehalp/y64.mmap', dtype='float64', mode='r', shape=(n,))
model = LGBMRegressor(**params)
print("Calling fit...")
model.fit(X, y)
I was able to successfully run the large dataset with a change to src/treelearner/ocl/histogram256.cl. I wanted to see if there was some sort of type difference between C++ and OpenCL. This function is defined as the following (with #ifdef constants removed for this post for clarity)
__kernel void histogram256(
__global const uchar4* feature_data_base,
__constant const uchar4* restrict feature_masks attribute((max_constant_size(65536))),
const data_size_t feature_size,
__global const data_size_t* data_indices,
const data_size_t num_data,
const score_t const_hessian,
__global const score_t* ordered_gradients, // <----- change to : __global const * ordered_gradients
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base
)
However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats. In order to validate the results. I ran two smaller dummy datasets with ordered_gradients defined explicitly and not. I compared the resulting model files and found that they were the same.
It's not yet clear to me why the change allows the program to finish and I am investigating this.
Thank you so much for the help! Whenever you feel you've identified the root cause, if you'd like to open a pull request we'd appreciate it, and can help with the contribution and testing process.
I have the same issue. I changed histogram256.ocl file as advised in this thread. But still the issue persists. Surprisingly, if I add 'is_unbalance=true' in my config, it trains the model without any problem. I run a huge dataset of 14GB on GPU. Can someone explain why this worked with addition of 'is_unbalance=true' inclusion?
Thats interesting. I'll see if I can mimic that later this week when I have more bandwidth. I did trace down where that parameter is used. It's used in binary_objective.hpp line 93:
if (is_unbalance_ && cnt_positive > 0 && cnt_negative > 0) {
if (cnt_positive > cnt_negative) {
label_weights_[1] = 1.0f;
label_weights_[0] = static_cast
I still get the same error without modifying the cl file and using this config. Let me know if yours differs @Bhuvanamitra :
task = train objective = regression boosting_type = gbdt metric = l2 num_leaves = 64 max_depth = 8 learning_rate = 0.05 n_estimators = 500 subsample_freq = 6 subsample = 0.91 colsample_bytree = 0.83 colsample_bynode = 0.78 min_child_samples = 200 min_child_weight = 88 max_bin = 71 enable_sparse = false device_type = gpu gpu_use_dp = false gpu_platform_id = 0 gpu_device_id = 7 is_unbalance=true
This error seems to occur during the tree building process, indicating a situation where a split was found but the left child of the split doesn't contain any data.
As a workaround, I found that setting min_split_gain to 1 avoids this issue. However, I understand that this solution might not be suitable for all use cases, as it could potentially make the model more conservative about creating new splits, which might not yield the best results in all scenarios.
I wanted to bring this to your attention and see if there might be a more general solution to this issue. Any insights or suggestions would be greatly appreciated.
Hi @tolleybot
I'm able to reproduce this issue with example from autogluon https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html and manually upgrading lightgbm to 4.0.0 version. I've run it on google collab CPU.
Unfortunately when min_split_gain trick doesn't work here as I got error:
Fitting model: LightGBM ...
Warning: Exception caused LightGBM to fail during training... Skipping this model.
Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .
Detailed Traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1733, in _train_and_save
model = self._train_single(X, y, model, X_val, y_val, total_resources=total_resources, **model_fit_kwargs)
File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1684, in _train_single
model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, total_resources=total_resources, **model_fit_kwargs)
File "/usr/local/lib/python3.10/dist-packages/autogluon/core/models/abstract/abstract_model.py", line 829, in fit
out = self._fit(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_model.py", line 194, in _fit
self.model = train_lgb_model(early_stopping_callback_kwargs=early_stopping_callback_kwargs, **train_params)
File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_utils.py", line 124, in train_lgb_model
return lgb.train(**train_params)
File "/usr/local/lib/python3.10/dist-packages/lightgbm/engine.py", line 266, in train
booster.update(fobj=fobj)
File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 3557, in update
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 237, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .
No base models to train on, skipping auxiliary stack level 2...
Note that this time complains about right_count.
Same issue. Is there any method to avoid this issue like changing some params?
I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.
I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.
I use min_child_weight to 1 and 1/X_train.shape[0]. Can't solve this error bro
Is this problem solved in any 4.0.0 and above versions?
I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?
I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?
I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly
@jameslamb Hello, this issue has been troubling me for many years. Currently, my training data far exceeds 64GB, yet I can only sample a dataset of 64GB in size, which will inevitably affect the model's generalization capabilities to some extent. Could you please schedule a solution for this? Many people have reported this problem, but it remains unresolved despite version updates.
I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?
I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly
I will give it a try to see if it resolves the issue and get back to you with a response in the near future. This problem has truly been a source of long-standing distress for me
Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.
Instructions for that build:
- https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version
- https://github.com/microsoft/LightGBM/tree/master/python-package#build-cuda-version
To use it, pass {"device": "cuda"} in params.
That version is more actively maintained and faster, and might not suffer from this issue.
Many people have reported this problem, but it remains unresolved despite version updates.
The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.
- @shiyu1994 and others at Microsoft seem to have been focusing exclusively on the CUDA implementation.
- @huanzhang12 , the original author of the OpenCL-based version (#368), has not engaged with this project in several years.
- @tolleybot was looking into this particular issue last year, but didn't get to the point of submitting any PRs
For those "many people" watching this, here's how you could help:
- provide a clear, minimal, reproducible example that always triggers this error
- including the type of GPU you have, version of LightGBM and all dependencies, fully self-contained code that uses public or synthetic data
- for help with that: https://stackoverflow.com/help/minimal-reproducible-example
- if you understand OpenCL, please come work on updating the
-DUSE_GPU=ONbuild of LightGBM here, and investigate this issue
Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.
Instructions for that build:
- https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version
- https://github.com/microsoft/LightGBM/tree/master/python-package#build-cuda-version
To use it, pass
{"device": "cuda"}in params.That version is more actively maintained and faster, and might not suffer from this issue.
Many people have reported this problem, but it remains unresolved despite version updates.
The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.
- @shiyu1994 and others at Microsoft seem to have been focusing exclusively on the CUDA implementation.
- @huanzhang12 , the original author of the OpenCL-based version (Initial GPU acceleration support for LightGBM #368), has not engaged with this project in several years.
- @tolleybot was looking into this particular issue last year, but didn't get to the point of submitting any PRs
For those "many people" watching this, here's how you could help:
provide a clear, minimal, reproducible example that always triggers this error
- including the type of GPU you have, version of LightGBM and all dependencies, fully self-contained code that uses public or synthetic data
- for help with that: https://stackoverflow.com/help/minimal-reproducible-example
if you understand OpenCL, please come work on updating the
-DUSE_GPU=ONbuild of LightGBM here, and investigate this issue
Thank you very much, I'll give it a try