save and use pre-computed embeddings, replace tf_idf vectorizer method
-
The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding. If no embedding is passed as a parameter, the default encoding process is followed. Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part
-
Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")
-
The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".
-
Simply removed an import line that appeared twice for SentenceTransformer.
Usage of pre-computed embeddings:
# Load embeddings
with open("embeddings.pickle", "rb") as fichier:
precomputed_embeddings = pickle.load(fichier)
# LeetTopic
leet_df, topic_data = leet_topic.LeetTopic(df,
document_field="descriptions",
html_filename="demo.html",
spacy_model="fr_core_news_md",
embeddings = precomputed_embeddings )
Hi Paul, Thank you for the PR! Indeed I think being able to save and load the embeddings may be a useful part of this application. A couple things I'm thinking about, and I would also like to get @wjbmattingly's thoughts on this:
- I wonder if np.save and np.load would be easier here rather than pickle. I think that np.save defaults to using pickle anyways and the embeddings are numpy arrays. Maybe there is a performance difference?
- If we do this in either implementation, we should also probably put a parameter so that the user can name the embeddings file.
- The newer scikit learn function name will be updated soon! Thanks for the reminder.
Hi, Oh yes, you are right about using numpy save/load since the output of the embedding is of type "numpy.ndarray".
# Save
np.save(save_embeddings_file_name , doc_embeddings)
# Load
doc_embeddings= np.load(embeddings_file_name)
Yes, the parameters to consider if we want to make the embedding parameterizable could be:
- if the embedding is passed as a parameter as a numpy.ndarray.
def LeetTopic(df: pd.DataFrame,
...
embeddings = None,
save_embeddings_file_name = "embeddings.save",
...
):
or 2. If the filename of the embedding is passed as a parameter (and then it needs to load it)
def LeetTopic(df: pd.DataFrame,
...
embeddings_file_name = None,
save_embeddings_file_name = "embeddings.save",
...
):
In any case:
- The embedding is calculated if the variable embeddings:numpy.ndarray or embeddings_file_name:str is not passed as a parameter.
- The embedding is saved to a file if the variable save_embeddings_file_name:str is passed as a parameter.
Up to you :)