bricks [MODULE] - Balanced active learner

Please describe the module you would like to add to bricks I just trained an active learner on a heavily imbalanced binary problem, and the following script helped me to try out a balanced model

Do you already have an implementation? This is not the final brick:

import numpy as np
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
# you can find further models here: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

class MyActiveLearner(LearningClassifier):

    def __init__(self):
        self.model = LogisticRegression()

    @params_fit(
        embedding_name = "box_content-classification-distilbert-base-uncased", # pick this from the options above
        train_test_split = 0.5 # we currently have this fixed, but you'll soon be able to specify this individually!
    )
    def fit(self, embeddings, labels):
        # convert string labels to integers
        label_dict = {'No': 0, 'Yes': 1}
        labels_int = np.array([label_dict[label] for label in labels])
        
        # separate majority and minority class data
        majority_emb = embeddings[labels_int==0]
        minority_emb = embeddings[labels_int==1]
        
        # determine the size of the minority class data
        minority_size = len(minority_emb)
        
        # randomly select a subset of the majority class data
        majority_emb = shuffle(majority_emb)[:minority_size]
        
        # combine the balanced data
        embeddings_resampled = np.concatenate([majority_emb, minority_emb])
        labels_resampled_int = np.concatenate([np.zeros(len(majority_emb)), np.ones(len(minority_emb))])
        
        # convert integer labels back to string labels
        labels_resampled = np.array([list(label_dict.keys())[list(label_dict.values()).index(label)] for label in labels_resampled_int])

        print(labels_resampled[:5])
        
        # fit the model with the resampled data
        self.model.fit(embeddings_resampled, labels_resampled)

    @params_inference(
        min_confidence = 0.5,
        label_names = None # you can specify a list to filter the predictions (e.g. ["label-a", "label-b"])
    )
    def predict_proba(self, embeddings):
        return self.model.predict_proba(embeddings)

Additional context -

Apr 22 '23 20:04 jhoetter

in general, what could be helpful for the active learners, is to have a couple of examples how the fit method can be modified. not always obvious, but you can do quite a lot.

Apr 22 '23 20:04 jhoetter

I was wondering if it would be better to sample the minority class and make it equal to the length of majority class instead of reducing the size of the majority class especially for smaller datasets,

Apr 23 '23 10:04 Rashid-Ahmed

you mean upsample instead of downsample? Could be a strategy or another brick :)

Apr 23 '23 15:04 jhoetter

exactly, maybe we could pass the sampling strategy as an argument

Apr 23 '23 15:04 Rashid-Ahmed

Great idea for a brick Johannes! Perhaps a brick using SMOTE would also be a good idea. Although I am not sure of how and if synthetic data works on text data.

Apr 24 '23 07:04 LeonardPuettmannKern