BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Saving predictions

Open econinomista opened this issue 2 years ago • 3 comments

Hi Maarten, I tried applying the transform function after training a model. So, I train the model and afterwards, I load my whole dataframe and preprocess the texts to make the predictions:

df= pd.read_csv(file_path) df = df.loc[df["Text_cleaned"].apply(lambda x: isinstance(x, str))] df.dropna(subset=["Text_cleaned"], inplace=True) df["Text_cleaned"] = df["Text_cleaned"].astype(str) texts = df['Text_cleaned'].tolist() topics, probs = topic_model.transform(texts)

How can I now store the results in my df and export it as a csv file again?

All the best and thank you very much Nikola

econinomista avatar Jan 30 '24 09:01 econinomista

You would have to do that manually. For instance, something like this:

df["topics] = topics
df["probs"] = probs
df.to_csv("my_file.csv")

MaartenGr avatar Jan 30 '24 10:01 MaartenGr

Thank you very much for the response. I tried something comparable, however this does not work for me. It yields:

Cell In[2], line 22 df["probs"] = probs

File ~\Documents\Python\envs\erniebert\lib\site-packages\pandas\core\frame.py:3980 in setitem self._set_item(key, value)

File ~\Documents\Python\envs\erniebert\lib\site-packages\pandas\core\frame.py:4187 in _set_item self._set_item_mgr(key, value)

File ~\Documents\Python\envs\erniebert\lib\site-packages\pandas\core\frame.py:4144 in _set_item_mgr self._mgr.insert(len(self._info_axis), key, value)

File ~\Documents\Python\envs\erniebert\lib\site-packages\pandas\core\internals\managers.py:1410 in insert raise ValueError(

ValueError: Expected a 1D array, got an array with shape (38, 7)

Am Di., 30. Jan. 2024 um 11:35 Uhr schrieb Maarten Grootendorst < @.***>:

You would have to do that manually. For instance, something like this:

df["topics] = topicsdf["probs"] = probsdf.to_csv("my_file.csv")

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1776#issuecomment-1916549070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6NUAF7Z54AGLVPX2X3YRDEFLAVCNFSM6AAAAABCQ5B23SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJWGU2DSMBXGA . You are receiving this because you authored the thread.Message ID: @.***>

econinomista avatar Jan 31 '24 09:01 econinomista

Ah right, you will need to maximum value in each row of the probs variable to get a single probability for each document. Do note that the probabilities of the -1 category are generally calculated as 1 - sum(probs). If you want to save the entire probability matrix, then you could also just save the topic model! Since probabilities are also saved as topic_model.probabilities_ you can access them when loading the model. See more about this here: https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html

MaartenGr avatar Jan 31 '24 16:01 MaartenGr