GeneralizedOptimalSparseDecisionTrees icon indicating copy to clipboard operation
GeneralizedOptimalSparseDecisionTrees copied to clipboard

Time limit ignored on Linux

Open mnaylor5 opened this issue 5 years ago • 5 comments

Hi Jimmy,

I'm trying to illustrate GOSDT with the diabetes dataset located here, and it seems that the time limit is being ignored. I've tried with continuous features, as well as discretizing on my own, but I can't seem to get anything to return in the amount of time I would expect. I'm running the following code in a Jupyter notebook on a Debian Linux instance with 8 cores and 30GB RAM, so I wouldn't suspect a hardware issue (RAM particularly is hovering around ~1GB used). This example took ~23 minutes on my machine.

## --- env setup --- ##
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
sys.path.append('../GeneralizedOptimalSparseDecisionTrees/python/') # location of cloned GOSDT repo
from model.gosdt import GOSDT 

## --- load data (directly from Kaggle location) --- ##
diabetes = pd.read_csv('diabetes.csv')

## --- same training/test split I'm using --- ##
train, _ = train_test_split(diabetes, random_state=0, test_size=0.2)
X = train.drop(columns="Outcome")
y = train['Outcome']

## --- specify and fit model --- ##
hyperparams = {
    "regularization": 0.1,
    "time_limit":10,
    "precision_limit":0.1,
    "worker_limit":8,
    "verbose": True
}

model = GOSDT(hyperparams)
model.fit(X, pd.DataFrame(y))
print(model.time / 60) # 23.861

A potentially separate issue is that I have only been able to get trees with a single split and two terminal nodes - this is perhaps due in part to the time limit issue not allowing any regularization lower than ~0.1, but I wanted to see if you could offer any advice. Here is the tree I'm getting, pretty much regardless of which combination of hyperparameters I use:

if 144 <= Glucose then:
    predicted class: 1
    misclassification penalty: 0.065
    complexity penalty: 0.1

else if Glucose < 144 then:
    predicted class: 0
    misclassification penalty: 0.191
    complexity penalty: 0.1

Thank you! -Mitch

mnaylor5 avatar Apr 02 '21 17:04 mnaylor5

Hi @mnaylor5,

Did you manage a way to fix or workaround this issue? I am also trying out the library and noticing that the time limit setting is ignored. I am running Ubuntu 10.04.

abhishek-ghose avatar Jun 13 '21 18:06 abhishek-ghose

Hi @abhishek-ghose, sorry for the slow response. I have not figured out a workaround - I've been using other optimal tree libraries instead.

mnaylor5 avatar Jun 26 '21 18:06 mnaylor5

Thank you @mnaylor5!

abhishek-ghose avatar Jun 26 '21 20:06 abhishek-ghose

Hi @mnaylor5! Sorry for the late response to this issue.

It appears this is a bug caused by the polling frequency of the optimizer, this is controlled by a configuration in src/optimizer.hpp:88 The tick-duration member controls how many iterations the optimizer goes through before checking the time. Since iterations were very fast for our experiments, 10000 iterations per check was a suitable balance between not spending too much time checking the clock and still stopping reasonably close to the desired time limit.

For the dataset you provided, it appears the iterations can be much slower which is likely due to the large branching factor. So checking every 10000 iterations wouldn't work very well. As an immediate solution I was able to get a more reasonable stopping precision with a tick-duration of 10 iterations. (Simply change the 1000 to 10 src/optimizer.hpp:88) and recompile the program.

This should fix things for your specific case. I'll try to think about what might be a more general solution.

Jimmy-Lin avatar Jun 28 '21 02:06 Jimmy-Lin

Hey @Jimmy-Lin - thanks for the response! I made the change you suggested, and it seems to successfully enforce the time limit.

This seems to lead to a couple of other issues. First, there seems to be a memory leak or something causing excessive RAM usage. This dataset is pretty small (614 observations in the training set, 8 continuous features, and a binary classification target), but a training run with a 1hr time limit uses ~190GB of RAM. The second is that I'm still getting the exact same basic tree as the output of that 1hr run on a larger machine (32 cores and 208GB RAM) - is this expected? I would think that it should have improved from the initial tree within an hour of searching, but this doesn't seem to be the case.

Any advice would be greatly appreciated! Thanks again!

mnaylor5 avatar Jul 02 '21 17:07 mnaylor5