scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

Memory leak using RandomForestClassifier and PCA

Open cannolis opened this issue 1 year ago • 4 comments

Describe the bug I am encountering a persistent memory leak when using RandomForestClassifier and PCA from the sklearnex library. With each iteration of my loop, the memory usage increases by approximately 20MB, which significantly impacts performance during large-scale data processing.

To Reproduce Steps to reproduce the behavior:

  1. Setup the environment with sklearnex installed.
  2. Initialize and configure RandomForestClassifier and PCA.
  3. Run a loop where RandomForestClassifier and PCA are used on the data.
  4. Observe the memory usage growth with each iteration.

Expected behavior I expect the memory usage to remain stable or return to the baseline after each iteration, ensuring efficient performance during large-scale data processing.

Environment: • OS: Windows 10 • Compiler: PyCharm • Version: 2024.1.2 Professional Edition

cannolis avatar Jun 20 '24 09:06 cannolis

Hi @cannolis thank you for the report! Please share more details about env your have, version of scikit-learn-intelex, daal4py

samir-nasibli avatar Jun 20 '24 12:06 samir-nasibli

Hi @samir-nasibli

Here are the details about my environment:

Python version: 3.9.19 scikit-learn-intelex version: 2024.4.0 daal4py version: 2024.4.0 scikit-learn version: 1.3.0

Thank you for looking into this issue. I appreciate your help and support. If you need any further information, please let me know.

cannolis avatar Jun 20 '24 12:06 cannolis

Hi @cannolis, thank you for raising the issue. Can you please provide a reproducer for your specific case? My initial investigation based on your your description doesn't show anything noticeable.

md-shafiul-alam avatar Jun 28 '24 14:06 md-shafiul-alam

Hi @md-shafiul-alam , had the same problem here in my environment:

Python version: 3.8.19 scikit-learn-intelex version: 2024.5.0 daal4py version: 2024.5.0 scikit-learn version: 1.3.2

Here is a code example you can reproduce this problem.Running this script for minutes, my memory usage goes up from 700M to 1G, and it keeps increasing.To reproduce these problem may take longer time, but I'm sure this does exist, as my training process has been interuptted for times because of out of memory.Once I remove sklearnex, it works fine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import random
from sklearnex import patch_sklearn

patch_sklearn()

class DataLoader:
    def __init__(self, filename):
        self.total = pd.read_csv(filename, header=None)
        self.data = self.total.iloc[:, :-1]
        self.label = self.total.iloc[:, -1]

    def load_data(self, feature: set):
        if len(feature) != 0:
            selected_data = self.data.iloc[:, feature]
        else:
            selected_data = self.data

        return selected_data, self.label


class Detector:
    def __init__(self):
        self.detector = RandomForestClassifier(random_state=0, n_estimators=50)

    def train_and_test(self, data, label):

        x_train, x_test, y_train, y_test = train_test_split(
            data, label, test_size=0.2, random_state=42
        )

        self.detector.fit(x_train, y_train)
        y_predict = self.detector.predict(x_test)

        accuracy = metrics.accuracy_score(y_test, y_predict)

        precision = metrics.precision_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        recall = metrics.recall_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        result = {}
        result["Accuracy"] = accuracy
        result["Precision"] = precision
        result["Recall"] = recall

        return result


DATASET = "KDDTrain+.csv"

while True:

    feature_set = [random.randint(0, 40) for _ in range(9)]

    data, label = DataLoader(DATASET).load_data(feature_set)

    classify_result = Detector().train_and_test(data, label)

    print(classify_result)

Dataset used in the code is here. KDDTrain+.csv

WindBlowAssCold avatar Sep 24 '24 09:09 WindBlowAssCold

Problem appears to have been fixed with this PR: https://github.com/uxlfoundation/scikit-learn-intelex/pull/2540

The fix should be available in version 2025.7 once it gets released.

david-cortes-intel avatar Jun 26 '25 15:06 david-cortes-intel

This is now solved in the latest release (2025.7).

david-cortes-intel avatar Jul 10 '25 06:07 david-cortes-intel