LeetTopic save and use pre-computed embeddings, replace tf

The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding. If no embedding is passed as a parameter, the default encoding process is followed. Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part
Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")
The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".
Simply removed an import line that appeared twice for SentenceTransformer.

Usage of pre-computed embeddings:

# Load embeddings
with open("embeddings.pickle", "rb") as fichier:
    precomputed_embeddings = pickle.load(fichier)

# LeetTopic
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          spacy_model="fr_core_news_md",
                                          embeddings = precomputed_embeddings )

Feb 17 '23 13:02 EquinetPaul

Hi Paul, Thank you for the PR! Indeed I think being able to save and load the embeddings may be a useful part of this application. A couple things I'm thinking about, and I would also like to get @wjbmattingly's thoughts on this:

I wonder if np.save and np.load would be easier here rather than pickle. I think that np.save defaults to using pickle anyways and the embeddings are numpy arrays. Maybe there is a performance difference?
If we do this in either implementation, we should also probably put a parameter so that the user can name the embeddings file.
The newer scikit learn function name will be updated soon! Thanks for the reminder.

Feb 17 '23 18:02 joelsjlee

Hi, Oh yes, you are right about using numpy save/load since the output of the embedding is of type "numpy.ndarray".

# Save 
np.save(save_embeddings_file_name , doc_embeddings)

# Load 
doc_embeddings= np.load(embeddings_file_name)

Yes, the parameters to consider if we want to make the embedding parameterizable could be:

if the embedding is passed as a parameter as a numpy.ndarray.

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

or 2. If the filename of the embedding is passed as a parameter (and then it needs to load it)

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings_file_name = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

In any case:

The embedding is calculated if the variable embeddings:numpy.ndarray or embeddings_file_name:str is not passed as a parameter.
The embedding is saved to a file if the variable save_embeddings_file_name:str is passed as a parameter.

Up to you :)

Feb 17 '23 18:02 EquinetPaul

save and use pre-computed embeddings, replace tf_idf vectorizer method