Transformer Embedding not working in Indonesian
I don't know this a bug or I just don't understanding yet but when I am using cahya/bert-base-indonesian-522M but it throw an error
embedding_model = pipeline("feature-extraction", model="cahya/bert-base-indonesian-522M")
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
diversity=0.5 # Step 6 - Diversify topic words
)
topic_model.fit_transform(df4)
topic_model.visualize_topics()
ValueError Traceback (most recent call last)
[<ipython-input-17-91cc4b963168>](https://localhost:8080/#) in <module>
----> 1 topic_model.visualize_topics()
6 frames
[/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py](https://localhost:8080/#) in _amax(a, axis, out, keepdims, initial, where)
38 def _amax(a, axis=None, out=None, keepdims=False,
39 initial=_NoValue, where=True):
---> 40 return umr_maximum(a, axis, None, out, keepdims, initial, where)
41
42 def _amin(a, axis=None, out=None, keepdims=False,
ValueError: zero-size array to reduction operation maximum which has no identity
but when I am using SentenceTransformer it's totally fine, any clue Why this happen? I also provided google colab for more better explanation. Thanks
#Change the embedding to sentence
sentence_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
Environment Info Goolge Colab
I am not entirely sure why this is happening. It seems to be working well with other transformer-based models. Instead, it might be worthwhile to compute the embeddings beforehand using the pipeline and pass those to BERTopic. Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model, although multilingual, might outperform it.
What might also be a nifty trick is to use that model within SentenceTransformers:
embedding_model = SentenceTransformer("cahya/bert-base-indonesian-522M")
topic_model = BERTopic(embedding_model=embedding_model)
I just tested it quickly and it seems to be working well.
Hi I am litle bit confused about Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model , is this talk about Bert Transformer cahya/bert-base-indonesian-522M where we cannot use since It not for feature extraction? or the paraphrase-multilingual-MiniLM-L12-v2 ?.
Last question about SentenceTransformer model, although multilingual, might outperform it. since not a lot of transformer trained in Indonesia how I can the get supported model and can be used in bertopic in hugginface or sentence transformer ? or We can use all transformer model available in hugginface or sentence transformer or It need a specific requirement for the model to run? Thanks for you hard work
Hi I am litle bit confused about Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model , is this talk about Bert Transformer cahya/bert-base-indonesian-522M where we cannot use since It not for feature extraction? or the paraphrase-multilingual-MiniLM-L12-v2 ?.
Indeed, the cahya/bert-base-indonesian-522M model was not optimized for semantic similarity and since we want to cluster the documents, it definitely helps if the model was focused on that. In contrast, the paraphrase-multilingual-MiniLM-L12-v2 model is optimized for that purpose and generally works rather well.
Last question about SentenceTransformer model, although multilingual, might outperform it. since not a lot of transformer trained in Indonesia how I can the get supported model and can be used in bertopic in hugginface or sentence transformer ? or We can use all transformer model available in hugginface or sentence transformer or It need a specific requirement for the model to run? Thanks for you hard work
I am not entirely sure but I believe the paraphrase-multilingual-MiniLM-L12-v2 also works for Indonesian texts, so you can use that instead.
Due to inactivity, I'll be closing this issue for now. Feel free to reach out if you want to continue this discussion or re-open the issue!