dask-xgboost icon indicating copy to clipboard operation
dask-xgboost copied to clipboard

unable to finish training

Open sjl070707 opened this issue 8 years ago • 22 comments

setup: dask 0.14 . (pip installed) xgboost 0.62 (conda installed) dask-xgboost 0.10.X (modified distributed.comm.addressing) for loading import dask_xgboost without error (https://github.com/dask/dask-xgboost/issues/1)

I was following the example here, https://gist.github.com/mrocklin/3696fe2398dc7152c66bf593a674e4d9

screen shot 2017-03-05 at 3 42 07 am

i produces the job, and looks like it runs for a few minutes.

screen shot 2017-03-05 at 3 37 38 am

however there would be some errors and would not finish nor crash my python code.

screen shot 2017-03-05 at 3 38 01 am

I wish I could provide more logs.

sjl070707 avatar Mar 05 '17 08:03 sjl070707

i recreated new container by putting the right version of dependencies both python and system

g++ gcc

pip install xgboost==0.6a2 dask-xgboost==0.1.0

python 3.5.4 via https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh miniconda3

screen shot 2017-03-05 at 5 11 32 am screen shot 2017-03-05 at 5 11 40 am

the process is stuck at training stage. (booster)

sjl070707 avatar Mar 05 '17 10:03 sjl070707

it looks like an exception occurs and dask-xgboost is trying to handle this error. or just waiting a long long time to be synched

Traceback (most recent call last): File "/work/miniconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 6, in bst = dxgb.train(client, params, data_train, labels_train) File "/work/miniconda/lib/python3.5/site-packages/dask_xgboost/core.py", line 154, in train return sync(client.loop, _train, client, params, data, labels, **kwargs) File "/work/miniconda/lib/python3.5/site-packages/distributed/utils.py", line 202, in sync e.wait(1000000) File "/work/miniconda/lib/python3.5/threading.py", line 549, in wait signaled = self._cond.wait(timeout) File "/work/miniconda/lib/python3.5/threading.py", line 297, in wait gotit = waiter.acquire(True, timeout)

sjl070707 avatar Mar 05 '17 18:03 sjl070707

the process is stuck at training stage. (booster)

Do we know why it's stuck? Does XGBoost provide logs here? At this stage Dask has done all of its work and has handed off control to XGBoost. Unfortunately we don't have any visibility into what's going on internally.

mrocklin avatar Mar 08 '17 14:03 mrocklin

I have the same issue here using bioconda's xgboost 0.6.a2 and dask-xgboost from pip. Stuck at the same place. Where can I find the log for XGBoost?

DigitalPig avatar May 09 '17 03:05 DigitalPig

@mrocklin What version of xgboost do you use? I was wondering if the stuck issue is because of the newly released xgboost? Do you compile the xgboost from source or installed from conda?

DigitalPig avatar May 09 '17 13:05 DigitalPig

@mrocklin What version of xgboost do you use? I was wondering if the stuck issue is because of the newly released xgboost? Do you compile the xgboost from source or installed from conda?

Perhaps. I don't use xgboost regularly and so don't have a standard version. I probably used whatever was recent when I first wrote this code.

Running the test suite today after conda-installing xgboost from the conda-forge channel I find that dataframe tests pass but that the dask.array test segfaults. I don't experience the same behavior as above. I'll try to go over it again with newer libraries sometime, but can't promise that this will happen any time soon. Any help from others would be welcome here.

mrocklin avatar May 09 '17 19:05 mrocklin

Ah. did not notice that we have test code there. Great. I can play around with it to see what I can do.

PS: Can you also reopen this issue?

DigitalPig avatar May 09 '17 21:05 DigitalPig

Thank you for your effort here @DigitalPig . Reopened

mrocklin avatar May 09 '17 21:05 mrocklin

I found a very interesting thing when trying to figure out why it gets stuck. I ran test code on my Ubuntu 16.10 with conda equipped with python 3.6/xgboost from conda-forge/dask-xgboost from github. Everything seems to be fine. All tests pass and I use Titanic training data with dask_xgboost and it successfully trained as well except complaining the "hist" option in params.

But on my cluster, which is built based on AWS EC2 RHEL7. I cannot pass the test_numpy. Also, my training got stuck using Titanic data as well. Same conda environment.

I am going to provision a Ubuntu cluster and see if I can reproduce the issue.

DigitalPig avatar May 12 '17 06:05 DigitalPig

Indeed seeing tests have different behavior in the same conda environment is quite odd.

mrocklin avatar May 12 '17 12:05 mrocklin

After provisioned a Ubuntu 14.04 cluster with same setup, all tests can pass now. The toy example using titanic can run through as well both under LocalCluster and real cluster.

I think this may due to the RHEL7 issue somewhere, although I am not sure where.

Also, it would be really great that we can grab output information from xgboost during the process.

DigitalPig avatar May 13 '17 05:05 DigitalPig

Also, it would be really great that we can grab output information from xgboost during the process.

Presumably this is passing through stdout or the Python logging module? Historically Dask has relied on cluster managers to handle logs. For LocalCluster you can start with LocalCluster(silence_logs=False) to get output on stdout/stderr.

This comes up decently often enough that we might want to have some mechanism to stream logs back though. I'll ponder this, though it's unlikely to be solved immediately.

mrocklin avatar May 30 '17 14:05 mrocklin

Thank you for tracking down the issue with system libraries by the way. Any thoughts on which dependency within Ubuntu 14.04 vs RHEL7 might be relevant here?

mrocklin avatar May 30 '17 14:05 mrocklin

Not at this point... But I will spin up smaller clusters with two dist and dig a little bit more. Any finding from your side?

DigitalPig avatar Jun 15 '17 03:06 DigitalPig

No finding from my side. To be honest I haven't looked into this problem much (there have been a few other things going on for me.) My apologies for not contributing here.

mrocklin avatar Jun 15 '17 11:06 mrocklin

Hi there! having very similar problem. Using docker containers on a CentOS Linux release 7.3.1611 (Core), everything with dask/distributed seems to work fine (basic tests, dask grid search, joblib integration), but when using dxgb.train for very small train task, it never finishes. See some changes on dask UI, but then it stops.

Interestingly enough, dxgb.train runs fine locally on my windows docker env, but not on the centos docker env (distributed).

(using a docker image based on ogrisel/distributed)

FROM ogrisel/distributed

RUN pip install -force dask-xgboost

RUN conda install -y py-xgboost RUN conda install -y seaborn RUN conda install -y dask-searchcv -c conda-forge

rquintino avatar Aug 08 '17 10:08 rquintino

Even I am encountering the same problem.

import dask
import dask.dataframe as dd
from dask.distributed import Client
import dask_xgboost as dxgb

client = Client('192.168.50.211:8786')
client.restart()

df = dd.read_csv("adult_comp_cont", storage_options={'anon' : True})
df = df[:100]
df.columns = [str(i) for i in range(6)] + ['target']
Y = df['target']
X = df.drop('target', axis=1)

x, y = dask.persist(X, Y)
params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
dxgb.train(client, params, x, y)

It's getting stuck indefinitely.

sagnik-rzt avatar Jun 08 '18 12:06 sagnik-rzt

It's getting stuck indefinitely.

What's happening on the dashboard at this point? Are you sure the data has finished loading with the call to .persist?

TomAugspurger avatar Jun 08 '18 12:06 TomAugspurger

Thanks for pointing out the mistake! I did this:

import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import dask_xgboost as dxgb

lc = LocalCluster(processes=False, scheduler_port=8989)
client = Client(lc.scheduler_address)

df = dd.read_csv("adult_comp_cont", storage_options={'anon' : True})
df = df[:100]
df.columns = [str(i) for i in range(6)] + ['target']
Y = df['target']
X = df.drop('target', axis=1)

params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
dxgb.train(client, params, X, Y)

and got this successfully:

[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3

The issue of delay was because of some conflict between the scheduler and the workers, so I used LocalCluster with process=False

sagnik-rzt avatar Jun 09 '18 08:06 sagnik-rzt

I am encountering the same problem. Set up DASK-XGBOOST on kubernetes and small training set went through smoothly. Then tried 8Gb HIGGS, I am seeing the same error, i..e the job run for a while before entering training, then it's stuck in training forever and all workers cpu dropped to 2% to 0%. Logs are very limited.

Have anybody ever tried DASK+XGBOOST on large dataset that cannot fit in one machine's memory?

hiyangbo avatar Feb 05 '19 01:02 hiyangbo

I am also facing the same issue. I am trying to run dask_xgboost with GPU option enabled on a large dataset.

The dask_xgboost is working fine when the dataset is small. When I tried with 10K, 100K, 1M data points, it worked perfectly. When I increased it to 10M, it's failing and the Dask dashboard is not responding. It's failing before entering the "train_part".

For your reference, I am using the below code for this experiment.

import dask.dataframe as dd
import dask_xgboost as dxgb
df = dd.read_csv("large_dataset.csv")
y = df['target']
X = df.drop(columns=['target'])
params = {'objective': 'reg:linear', 'nround': 1000,
          'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 1, 'tree_method': 'gpu_hist'}
bst = dxgb.train(client, params, X, y)

FYI: I am working on AWS EMR cluster. It can scale up to 21 nodes each having a capacity of ~2.2GB.

Please throw light on this. It would be good if you give me suggestions to make this work.

@mrocklin @TomAugspurger

Abhishekmamidi123 avatar Apr 30 '20 12:04 Abhishekmamidi123

XGBoost expects the dataset to fit comfortably in memory. Perhaps the dataset is larger than RAM in the way that XGBoost stores it? I would look at the Dask dashboard and see if worker memory was getting high.

On Thu, Apr 30, 2020 at 5:46 AM Abhishekmamidi [email protected] wrote:

I am also facing the same issue. I am trying to run dask_xgboost with GPU option enabled on a large dataset.

The dask_xgboost is working fine when the dataset is small. When I tried with 10K, 100K, 1M data points, it worked perfectly. When I increased it to 10M, it's failing and the Dask dashboard is not responding. It's failing before entering the "train_part".

For your reference, I am using the below code for this experiment.

import dask.dataframe as dd import dask_xgboost as dxgb df = dd.read_csv("large_dataset.csv") y = df['target'] X = df.drop(columns=['target']) params = {'objective': 'reg:linear', 'nround': 1000, 'max_depth': 16, 'eta': 0.01, 'subsample': 0.5, 'min_child_weight': 1, 'tree_method': 'gpu_hist'} bst = dxgb.train(client, params, X, y)

FYI: I am working on AWS EMR cluster. It can scale up to 21 nodes each having a capacity of ~2.2GB.

Please throw light on this. It would be good if you give me suggestions to make this work.

@mrocklin https://github.com/mrocklin @TomAugspurger https://github.com/TomAugspurger

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/2#issuecomment-621811398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDXOXYHAGNDRAD6R7TRPFXLRANCNFSM4DCNNIYQ .

mrocklin avatar Apr 30 '20 14:04 mrocklin