LeetTopic icon indicating copy to clipboard operation
LeetTopic copied to clipboard

save and use pre-computed embeddings, replace tf_idf vectorizer method

Open EquinetPaul opened this issue 3 years ago • 2 comments

  1. The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding. If no embedding is passed as a parameter, the default encoding process is followed. Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part

  2. Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")

  3. The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".

  4. Simply removed an import line that appeared twice for SentenceTransformer.

Usage of pre-computed embeddings:

# Load embeddings
with open("embeddings.pickle", "rb") as fichier:
    precomputed_embeddings = pickle.load(fichier)

# LeetTopic
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          spacy_model="fr_core_news_md",
                                          embeddings = precomputed_embeddings )

EquinetPaul avatar Feb 17 '23 13:02 EquinetPaul

Hi Paul, Thank you for the PR! Indeed I think being able to save and load the embeddings may be a useful part of this application. A couple things I'm thinking about, and I would also like to get @wjbmattingly's thoughts on this:

  1. I wonder if np.save and np.load would be easier here rather than pickle. I think that np.save defaults to using pickle anyways and the embeddings are numpy arrays. Maybe there is a performance difference?
  2. If we do this in either implementation, we should also probably put a parameter so that the user can name the embeddings file.
  3. The newer scikit learn function name will be updated soon! Thanks for the reminder.

joelsjlee avatar Feb 17 '23 18:02 joelsjlee

Hi, Oh yes, you are right about using numpy save/load since the output of the embedding is of type "numpy.ndarray".

# Save 
np.save(save_embeddings_file_name , doc_embeddings)

# Load 
doc_embeddings= np.load(embeddings_file_name)

Yes, the parameters to consider if we want to make the embedding parameterizable could be:

  1. if the embedding is passed as a parameter as a numpy.ndarray.
def LeetTopic(df: pd.DataFrame,
            ...
            embeddings = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

or 2. If the filename of the embedding is passed as a parameter (and then it needs to load it)

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings_file_name = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

In any case:

  • The embedding is calculated if the variable embeddings:numpy.ndarray or embeddings_file_name:str is not passed as a parameter.
  • The embedding is saved to a file if the variable save_embeddings_file_name:str is passed as a parameter.

Up to you :)

EquinetPaul avatar Feb 17 '23 18:02 EquinetPaul