[MODULE] - Balanced active learner
Please describe the module you would like to add to bricks I just trained an active learner on a heavily imbalanced binary problem, and the following script helped me to try out a balanced model
Do you already have an implementation? This is not the final brick:
import numpy as np
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
# you can find further models here: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
class MyActiveLearner(LearningClassifier):
def __init__(self):
self.model = LogisticRegression()
@params_fit(
embedding_name = "box_content-classification-distilbert-base-uncased", # pick this from the options above
train_test_split = 0.5 # we currently have this fixed, but you'll soon be able to specify this individually!
)
def fit(self, embeddings, labels):
# convert string labels to integers
label_dict = {'No': 0, 'Yes': 1}
labels_int = np.array([label_dict[label] for label in labels])
# separate majority and minority class data
majority_emb = embeddings[labels_int==0]
minority_emb = embeddings[labels_int==1]
# determine the size of the minority class data
minority_size = len(minority_emb)
# randomly select a subset of the majority class data
majority_emb = shuffle(majority_emb)[:minority_size]
# combine the balanced data
embeddings_resampled = np.concatenate([majority_emb, minority_emb])
labels_resampled_int = np.concatenate([np.zeros(len(majority_emb)), np.ones(len(minority_emb))])
# convert integer labels back to string labels
labels_resampled = np.array([list(label_dict.keys())[list(label_dict.values()).index(label)] for label in labels_resampled_int])
print(labels_resampled[:5])
# fit the model with the resampled data
self.model.fit(embeddings_resampled, labels_resampled)
@params_inference(
min_confidence = 0.5,
label_names = None # you can specify a list to filter the predictions (e.g. ["label-a", "label-b"])
)
def predict_proba(self, embeddings):
return self.model.predict_proba(embeddings)
Additional context -
in general, what could be helpful for the active learners, is to have a couple of examples how the fit method can be modified. not always obvious, but you can do quite a lot.
I was wondering if it would be better to sample the minority class and make it equal to the length of majority class instead of reducing the size of the majority class especially for smaller datasets,
you mean upsample instead of downsample? Could be a strategy or another brick :)
exactly, maybe we could pass the sampling strategy as an argument
Great idea for a brick Johannes! Perhaps a brick using SMOTE would also be a good idea. Although I am not sure of how and if synthetic data works on text data.