LightGBM Check failed: (best_split_info.left

Hi, I found a bug when training with large X_train.

lgb-gpu version: 3.3.2 CUDA=11.1 CentOS ram=2TB GPU=A100-40G

when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this: [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686

When I use LGB==3.2.1 , I have the same problem as #4480 : when the GPU memory more than 8.3G will Memory Object Allocation Failure

in this LGB version==3.3.2 , LGB can't load more than 17G GPU memory (GPU has 40G memory), it seems like something problem occur in the tree split step and only happened when GPU memory loaded more than 17G.

Another colleague has this same problem, his lgb version is 3.3.1

Jan 13 '22 03:01 chixujohnny

Thanks very much for using LightGBM!

I suspect that this is the same as other issues that have been reported (e.g. #4739, #3679), but it's hard to say without more details.

Are you able to provide any (and hopefully all) of the following?

specific commands you used to install LightGBM
the most minimal possible complete version of the code you're running which can reproduce this issue

That would be very helpful. Without such information and with only an error message, significant investigation will probably be required to figure out why you encountered this error.

Jan 13 '22 03:01 jameslamb

Thanks for your reply!

I build the source code in this command: mkdir build ; cd build ; cmake -DUSE_GPU=1 .. && make -j

Sorry I can't paste my code because the code is on my comany machine. I can't screenshots or copy the code, but I can descript my code: X = np.random.rand(18000000, 1000).astype('float32') y = np.random.rand(18000000).astype('float32') model = LGBMRegressor(**params) model.fit(X, y)

Jan 13 '22 03:01 chixujohnny

That's ok, completely understand that the code might be sensitive.

Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

Jan 13 '22 03:01 jameslamb

params = { ‘n_estimators’ : 500, 'learning_rate' : 0.05, 'subsample_freq' : 6 , 'subsample' : 0.91, 'colsample_bytree' : 0.83, 'colsample_bynode' : 0.78, 'num_leaves': 64, 'max_depth' : 8, 'reg_alpha' : 9, 'reg_lambda' : 3.5, 'min_child_samples' : 200, 'min_child_weight' : 88, 'max_bin': 71, 'enable_sparse': false, 'device_type': "gpu", 'gpu_use_dp': false }

Jan 13 '22 04:01 chixujohnny

That's ok, completely understand that the code might be sensitive.

Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?

Jan 13 '22 06:01 chixujohnny

I personally don't have easy access to hardware like that. I might try at some point to get a VM from a cloud provider and work on some of the open GPU-specific issues in this project, but can't commit to that. Maybe some other maintainer or contributor will be able to help you.

Jan 15 '22 02:01 jameslamb

That's ok, completely understand that the code might be sensitive. Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?

I have A100-40G and the above code fails. interestingly, when I run it on 9M rows instead of 18M it doesn't fail. a quick calculation shows that 9M rows is just below the 40GB mark so might be related, might not. In my own code I get this error unless I lower num_leaves and have to tweak it again if I change max_bin, if this helps. This is also company property so won't be able to share the data or code. Other than that, will help any way I can.

Jan 18 '22 20:01 lironle6

shared some details on the issue in another topic https://github.com/microsoft/LightGBM/issues/2793#issuecomment-1305260660

Nov 07 '22 08:11 pavlexander

I was able to reproduce this.

Details are below, but running the @chixujohnny sample code:

[LightGBM] [Info] Number of data points in the train set: 18000000, number of used features: 1000
[LightGBM] [Info] Using requested OpenCL platform 0 device 7
[LightGBM] [Info] Using GPU Device: NVIDIA A100-SXM4-40GB, Vendor: NVIDIA Corporation

dies with [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /home/me/src/LightGBM/src/treelearner/serial_tree_learner.cpp, line 682

If I switch from OpenCL to CUDA ('device_type': 'cuda'), then CUDA dies reporting out of memory. Note: while CUDA does report out of memory, execution continues with repeated FATAL "out of memory log" messages and finally ending in a segfault. -- so the CUDA tree learner also needs some more robust error checking)

Details (for the curious):

# this recreates the error
import numpy as np
from lightgbm import LGBMRegressor
import sys

params = {
    'n_estimators': 500,
    'learning_rate': 0.05,
    'subsample_freq': 6,
    'subsample': 0.91,
    'colsample_bytree': 0.83,
    'colsample_bynode': 0.78,
    'num_leaves': 64,
    'max_depth': 8,
    'reg_alpha': 9,
    'reg_lambda': 3.5,
    'min_child_samples': 200,
    'min_child_weight': 88,
    'max_bin': 71,
    'enable_sparse': False,
    'device_type': "gpu",
    'gpu_use_dp': False,
    'gpu_platform_id': 0,
    'gpu_device_id': 7,
}

n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
    n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")
# pre-generated data (see below)
X = np.memmap('X.mmap', dtype='float32', mode='r', shape=(n,m))
y = np.memmap('y.mmap', dtype='float32', mode='r', shape=(n,))

model = LGBMRegressor(**params)
model.fit(X, y)

# building LightGBM
#build boost
wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/boost_1_81_0.tar.gz
tar xvf boost_1_81_0.tar.gz
cd boost_1_81_0
./bootstrap.sh --prefix=/home/me/install
./b2 install

#build LightGBM
# version
#commit e4231205a3bac13662a81db9433ddaea8924fbce (HEAD -> master, origin/master, origin/HEAD)
#Author: James Lamb <[email protected]>
#Date:   Tue Feb 28 23:35:20 2023 -0600
#
 #   [python-package] use keyword arguments in predict() calls (#5755)

cmake .. -DUSE_CUDA=YES -DUSE_GPU=YES  -DBOOST_ROOT:PATHNAME=/home/me/install/

#install python
conda create -n LGBM python=3.9 numpy scipy scikit-learn
conda activate LGBM
export PYTHONPATH=$(pwd)/python-package

# data generation
# why? I hit some weird issue where generating random numbers goes from ~3ns per sample for a few 1million
# to ~200ns per sample for 18e9.  The issue is mitigated by generating into memapped memory (~10ns per sample)
# mystery for another day....

import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns

n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
    n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")

rng = np.random.default_rng()

print("generating data...")
s = time_ns()
X = rng.random((n,m), 'float32')

X = np.memmap('X.mmap', dtype='float32', mode='w+', shape=(n,m))
rng.random((n, m), 'float32', out=X)
e = time_ns()
print(f"  {(e-s)/samples:.2f} ns per sample, {(e-s)/1e9:.3f} s elapsed")
print("flushing")
X.flush()

y = np.memmap('y.mmap', dtype='float32', mode='w+', shape=(n,))
rng.random((n, ), 'float32', out=y)
y.flush()

Mar 03 '23 20:03 habemus-papadum

Hi @jameslamb and others, I've attached a log that demonstrates the [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) error. (w/ GPU_DEBUG=5, 'verbose': 3, 'seed'=1234, 'gpu_use_dp': True -- full script below).

It is possible for me to deterministically recreate this error.

I was wondering if you had any pointers about how to go about debugging this further.

Is left_count==0 usually indicative of a NaN or Inf somewhere?
GPUTreeLearner::ConstructHistograms and SerialTreeLearner::FindBestSplitsFromHistograms seem to to bee the places to look, but I'm not entirely sure what to look for...

Thanks in advance for any advice!

log3_trimmed.txt

Mar 08 '23 13:03 habemus-papadum

code for above

import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns

params = {
    'n_estimators': 500,
    'learning_rate': 0.05,
    'subsample_freq': 6,
    'subsample': 0.91,
    'colsample_bytree': 0.83,
    'colsample_bynode': 0.78,
    'num_leaves': 64,
    'max_depth': 8,
    'reg_alpha': 9,
    'reg_lambda': 3.5,
    'min_child_samples': 200,
    'min_child_weight': 88,
    'max_bin': 71,
    'enable_sparse': False,
    'device_type': "gpu",
    'gpu_use_dp': True,
    'gpu_platform_id': 0,
    'gpu_device_id': 7,
    'seed': 1234,
    'verbose': 3
}

n = 18*1000*1000
m = 1000
if len(sys.argv) == 2:
    n = int(sys.argv[1])

samples = n*m
print(f"N: {n:,}, Samples: {samples:,}")
X = np.memmap('/raid/scratch/nehalp/X64.mmap', dtype='float64', mode='r', shape=(n,m))
y = np.memmap('/raid/scratch/nehalp/y64.mmap', dtype='float64', mode='r', shape=(n,))

model = LGBMRegressor(**params)
print("Calling fit...")
model.fit(X, y)

Mar 08 '23 13:03 habemus-papadum

I was able to successfully run the large dataset with a change to src/treelearner/ocl/histogram256.cl. I wanted to see if there was some sort of type difference between C++ and OpenCL. This function is defined as the following (with #ifdef constants removed for this post for clarity)

__kernel void histogram256(
__global const uchar4* feature_data_base,
__constant const uchar4* restrict feature_masks attribute((max_constant_size(65536))),
const data_size_t feature_size,
__global const data_size_t* data_indices,
const data_size_t num_data,
const score_t const_hessian,
__global const score_t* ordered_gradients, // <----- change to : __global const * ordered_gradients
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base
)

However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats. In order to validate the results. I ran two smaller dummy datasets with ordered_gradients defined explicitly and not. I compared the resulting model files and found that they were the same.

It's not yet clear to me why the change allows the program to finish and I am investigating this.

Apr 03 '23 17:04 tolleybot

Thank you so much for the help! Whenever you feel you've identified the root cause, if you'd like to open a pull request we'd appreciate it, and can help with the contribution and testing process.

Apr 03 '23 17:04 jameslamb

I have the same issue. I changed histogram256.ocl file as advised in this thread. But still the issue persists. Surprisingly, if I add 'is_unbalance=true' in my config, it trains the model without any problem. I run a huge dataset of 14GB on GPU. Can someone explain why this worked with addition of 'is_unbalance=true' inclusion?

Apr 11 '23 11:04 Bhuvanamitra

Thats interesting. I'll see if I can mimic that later this week when I have more bandwidth. I did trace down where that parameter is used. It's used in binary_objective.hpp line 93:

if (is_unbalance_ && cnt_positive > 0 && cnt_negative > 0) { if (cnt_positive > cnt_negative) { label_weights_[1] = 1.0f; label_weights_[0] = static_cast(cnt_positive) / cnt_negative; } else { label_weights_[1] = static_cast(cnt_negative) / cnt_positive; label_weights_[0] = 1.0f; } }

Apr 11 '23 14:04 tolleybot

I still get the same error without modifying the cl file and using this config. Let me know if yours differs @Bhuvanamitra :

task = train objective = regression boosting_type = gbdt metric = l2 num_leaves = 64 max_depth = 8 learning_rate = 0.05 n_estimators = 500 subsample_freq = 6 subsample = 0.91 colsample_bytree = 0.83 colsample_bynode = 0.78 min_child_samples = 200 min_child_weight = 88 max_bin = 71 enable_sparse = false device_type = gpu gpu_use_dp = false gpu_platform_id = 0 gpu_device_id = 7 is_unbalance=true

Apr 27 '23 15:04 tolleybot

This error seems to occur during the tree building process, indicating a situation where a split was found but the left child of the split doesn't contain any data.

As a workaround, I found that setting min_split_gain to 1 avoids this issue. However, I understand that this solution might not be suitable for all use cases, as it could potentially make the model more conservative about creating new splits, which might not yield the best results in all scenarios.

I wanted to bring this to your attention and see if there might be a more general solution to this issue. Any insights or suggestions would be greatly appreciated.

Jun 05 '23 20:06 tolleybot

Hi @tolleybot I'm able to reproduce this issue with example from autogluon https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html and manually upgrading lightgbm to 4.0.0 version. I've run it on google collab CPU. Unfortunately when min_split_gain trick doesn't work here as I got error:

Fitting model: LightGBM ...
	Warning: Exception caused LightGBM to fail during training... Skipping this model.
		Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Detailed Traceback:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1733, in _train_and_save
    model = self._train_single(X, y, model, X_val, y_val, total_resources=total_resources, **model_fit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1684, in _train_single
    model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, total_resources=total_resources, **model_fit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/models/abstract/abstract_model.py", line 829, in fit
    out = self._fit(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_model.py", line 194, in _fit
    self.model = train_lgb_model(early_stopping_callback_kwargs=early_stopping_callback_kwargs, **train_params)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_utils.py", line 124, in train_lgb_model
    return lgb.train(**train_params)
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/engine.py", line 266, in train
    booster.update(fobj=fobj)
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 3557, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 237, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

No base models to train on, skipping auxiliary stack level 2...

Note that this time complains about right_count.

Jul 28 '23 01:07 mglowacki100

Same issue. Is there any method to avoid this issue like changing some params?

Nov 23 '23 10:11 WatsonCao

I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.

Nov 25 '23 18:11 dsilverberg95

I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.

I use min_child_weight to 1 and 1/X_train.shape[0]. Can't solve this error bro

Jan 03 '24 05:01 chixujohnny

Is this problem solved in any 4.0.0 and above versions?

Jan 23 '24 03:01 wqxl309

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

Feb 21 '24 14:02 flexlev

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly

Feb 29 '24 09:02 wqxl309

@jameslamb Hello, this issue has been troubling me for many years. Currently, my training data far exceeds 64GB, yet I can only sample a dataset of 64GB in size, which will inevitably affect the model's generalization capabilities to some extent. Could you please schedule a solution for this? Many people have reported this problem, but it remains unresolved despite version updates.

May 23 '24 03:05 chixujohnny

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly

I will give it a try to see if it resolves the issue and get back to you with a response in the near future. This problem has truly been a source of long-standing distress for me

May 23 '24 03:05 chixujohnny

Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.

Instructions for that build:

https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version
https://github.com/microsoft/LightGBM/tree/master/python-package#build-cuda-version

To use it, pass {"device": "cuda"} in params.

That version is more actively maintained and faster, and might not suffer from this issue.

Many people have reported this problem, but it remains unresolved despite version updates.

The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.

@shiyu1994 and others at Microsoft seem to have been focusing exclusively on the CUDA implementation.
@huanzhang12 , the original author of the OpenCL-based version (#368), has not engaged with this project in several years.
@tolleybot was looking into this particular issue last year, but didn't get to the point of submitting any PRs

For those "many people" watching this, here's how you could help:

provide a clear, minimal, reproducible example that always triggers this error
- including the type of GPU you have, version of LightGBM and all dependencies, fully self-contained code that uses public or synthetic data
- for help with that: https://stackoverflow.com/help/minimal-reproducible-example
if you understand OpenCL, please come work on updating the -DUSE_GPU=ON build of LightGBM here, and investigate this issue

May 23 '24 03:05 jameslamb

Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.

Instructions for that build:

https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version

https://github.com/microsoft/LightGBM/tree/master/python-package#build-cuda-version

To use it, pass {"device": "cuda"} in params.

That version is more actively maintained and faster, and might not suffer from this issue.

Many people have reported this problem, but it remains unresolved despite version updates.

The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.

@shiyu1994 and others at Microsoft seem to have been focusing exclusively on the CUDA implementation.

@huanzhang12 , the original author of the OpenCL-based version (Initial GPU acceleration support for LightGBM #368), has not engaged with this project in several years.

@tolleybot was looking into this particular issue last year, but didn't get to the point of submitting any PRs

For those "many people" watching this, here's how you could help:

provide a clear, minimal, reproducible example that always triggers this error

including the type of GPU you have, version of LightGBM and all dependencies, fully self-contained code that uses public or synthetic data

for help with that: https://stackoverflow.com/help/minimal-reproducible-example

if you understand OpenCL, please come work on updating the -DUSE_GPU=ON build of LightGBM here, and investigate this issue

Thank you very much, I'll give it a try

May 24 '24 04:05 chixujohnny