Memory leak using RandomForestClassifier and PCA
Describe the bug I am encountering a persistent memory leak when using RandomForestClassifier and PCA from the sklearnex library. With each iteration of my loop, the memory usage increases by approximately 20MB, which significantly impacts performance during large-scale data processing.
To Reproduce Steps to reproduce the behavior:
- Setup the environment with sklearnex installed.
- Initialize and configure RandomForestClassifier and PCA.
- Run a loop where RandomForestClassifier and PCA are used on the data.
- Observe the memory usage growth with each iteration.
Expected behavior I expect the memory usage to remain stable or return to the baseline after each iteration, ensuring efficient performance during large-scale data processing.
Environment: • OS: Windows 10 • Compiler: PyCharm • Version: 2024.1.2 Professional Edition
Hi @cannolis thank you for the report!
Please share more details about env your have, version of scikit-learn-intelex, daal4py
Hi @samir-nasibli
Here are the details about my environment:
Python version: 3.9.19 scikit-learn-intelex version: 2024.4.0 daal4py version: 2024.4.0 scikit-learn version: 1.3.0
Thank you for looking into this issue. I appreciate your help and support. If you need any further information, please let me know.
Hi @cannolis, thank you for raising the issue. Can you please provide a reproducer for your specific case? My initial investigation based on your your description doesn't show anything noticeable.
Hi @md-shafiul-alam , had the same problem here in my environment:
Python version: 3.8.19 scikit-learn-intelex version: 2024.5.0 daal4py version: 2024.5.0 scikit-learn version: 1.3.2
Here is a code example you can reproduce this problem.Running this script for minutes, my memory usage goes up from 700M to 1G, and it keeps increasing.To reproduce these problem may take longer time, but I'm sure this does exist, as my training process has been interuptted for times because of out of memory.Once I remove sklearnex, it works fine.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import random
from sklearnex import patch_sklearn
patch_sklearn()
class DataLoader:
def __init__(self, filename):
self.total = pd.read_csv(filename, header=None)
self.data = self.total.iloc[:, :-1]
self.label = self.total.iloc[:, -1]
def load_data(self, feature: set):
if len(feature) != 0:
selected_data = self.data.iloc[:, feature]
else:
selected_data = self.data
return selected_data, self.label
class Detector:
def __init__(self):
self.detector = RandomForestClassifier(random_state=0, n_estimators=50)
def train_and_test(self, data, label):
x_train, x_test, y_train, y_test = train_test_split(
data, label, test_size=0.2, random_state=42
)
self.detector.fit(x_train, y_train)
y_predict = self.detector.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_predict)
precision = metrics.precision_score(
y_test, y_predict, pos_label=1, average="binary", zero_division=0
)
recall = metrics.recall_score(
y_test, y_predict, pos_label=1, average="binary", zero_division=0
)
result = {}
result["Accuracy"] = accuracy
result["Precision"] = precision
result["Recall"] = recall
return result
DATASET = "KDDTrain+.csv"
while True:
feature_set = [random.randint(0, 40) for _ in range(9)]
data, label = DataLoader(DATASET).load_data(feature_set)
classify_result = Detector().train_and_test(data, label)
print(classify_result)
Dataset used in the code is here. KDDTrain+.csv
Problem appears to have been fixed with this PR: https://github.com/uxlfoundation/scikit-learn-intelex/pull/2540
The fix should be available in version 2025.7 once it gets released.
This is now solved in the latest release (2025.7).