Inference only on one single document / string possible?
I need to extract topics from news articles. I naively tried to run BERTopic on just one article but am getting the following error:
ValueError: Transform unavailable when model was fit with only a single data sample.
(I am working with Sentence Transformers 'sentence-transformers/paraphrase-MiniLM-L3-v2' model and spaCy 3.2)
As these are arbitrary news articles, I don't want to train a new model. Are there any "allrounder" models available that can be used for that task?
Is inference with BERTopic possible without training?
I need to extract topics from news articles. I naively tried to run BERTopic on just one article but am getting the following error: ValueError: Transform unavailable when model was fit with only a single data sample.
Could you share the entire log of the error? It is now difficult to see where the issue stems from. Also, could you share the entire code for creating your BERTopic model?
Is inference with BERTopic possible without training?
No. The difficulty here is that there are easily millions of topics that theoretically exist and having a model that detects all of them without any training is not possible within BERTopic. If you have some idea of the topics that might exist in your data, then something like zero-shot topic modeling might be worthwhile to look into. Similarly, zero-shot text classification might also work well for your purpose. Other than that, you could also look at keyword extraction if you want to define the documents. Methods like KeyBERT and YAKE work quite well for that purpose.
@MaartenGr thanks for your answer. Here is the full code and the stack trace: ` import os import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(file))))))
from bertopic import BERTopic from sentence_transformers import SentenceTransformer
from findall.settings import (ENV, logger)
def extract_topics(text, model=SentenceTransformer(ENV('SENTENCE_TRANSFORMER')), language='en'): topic_model = BERTopic(language=language, embedding_model=model) topics, probs = topic_model.fit_transform([text]) print(topic_model.get_topic_info()) print(topic_model.get_topic(0)) # get all words in topic 0 words = [word for word, _ in topic_model.get_topic(0)] print('words', words) return words `
Error in extract_topics: Transform unavailable when model was fit with only a single data sample. in article: https://www.nytimes.com/2022/09/12/learning/lesson-plans/explore-how-crispr-is-revolutionizing-science.html Traceback (most recent call last): File "/home/fabmeyer/Dev/Python/analytics/analysis/TiingoAnalytics/TiingoNews_test_fabian_mongodb.py", line 107, in <module> article_dict['topics'] = extract_topics(article_dict['url_content']) File "/home/fabmeyer/Dev/Python/analytics/analysis/utils/analyse/ExtractTopics.py", line 13, in extract_topics topics, probs = topic_model.fit_transform([text]) File "/home/fabmeyer/.local/share/virtualenvs/analytics-nko8goOs/lib/python3.9/site-packages/bertopic/_bertopic.py", line 289, in fit_transform umap_embeddings = self._reduce_dimensionality(embeddings, y) File "/home/fabmeyer/.local/share/virtualenvs/analytics-nko8goOs/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1368, in _reduce_dimensionality umap_embeddings = self.umap_model.transform(embeddings) File "/home/fabmeyer/.local/share/virtualenvs/analytics-nko8goOs/lib/python3.9/site-packages/umap/umap_.py", line 2803, in transform raise ValueError( ValueError: Transform unavailable when model was fit with only a single data sample.
Yes, as training is at the moment not feasible, I need a pre-trained, "allrounder" model to solve this task. I will look into zero-shot-text classification.
@fabmeyer Yes, it indeed seems that a zero-shot text classification would work best for your use case. Based on the error that you get, UMAP does not support fitting on single data samples since it has no references to learn from.
Due to inactivity, I'll be closing this issue for now. Feel free to reach out if you want to continue this discussion or re-open the issue!