BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Transformer Embedding not working in Indonesian

Open dwissaaj opened this issue 3 years ago • 4 comments

I don't know this a bug or I just don't understanding yet but when I am using cahya/bert-base-indonesian-522M but it throw an error

embedding_model = pipeline("feature-extraction", model="cahya/bert-base-indonesian-522M")
topic_model = BERTopic(
  embedding_model=embedding_model,    # Step 1 - Extract embeddings
  umap_model=umap_model,              # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,        # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,  # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,          # Step 5 - Extract topic words
  diversity=0.5                       # Step 6 - Diversify topic words
)
topic_model.fit_transform(df4)
topic_model.visualize_topics()
ValueError                                Traceback (most recent call last)
[<ipython-input-17-91cc4b963168>](https://localhost:8080/#) in <module>
----> 1 topic_model.visualize_topics()

6 frames
[/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py](https://localhost:8080/#) in _amax(a, axis, out, keepdims, initial, where)
     38 def _amax(a, axis=None, out=None, keepdims=False,
     39           initial=_NoValue, where=True):
---> 40     return umr_maximum(a, axis, None, out, keepdims, initial, where)
     41 
     42 def _amin(a, axis=None, out=None, keepdims=False,

ValueError: zero-size array to reduction operation maximum which has no identity

but when I am using SentenceTransformer it's totally fine, any clue Why this happen? I also provided google colab for more better explanation. Thanks

#Change the embedding to sentence
sentence_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

Environment Info Goolge Colab

dwissaaj avatar Sep 18 '22 08:09 dwissaaj

I am not entirely sure why this is happening. It seems to be working well with other transformer-based models. Instead, it might be worthwhile to compute the embeddings beforehand using the pipeline and pass those to BERTopic. Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model, although multilingual, might outperform it.

MaartenGr avatar Sep 20 '22 07:09 MaartenGr

What might also be a nifty trick is to use that model within SentenceTransformers:

embedding_model = SentenceTransformer("cahya/bert-base-indonesian-522M")
topic_model = BERTopic(embedding_model=embedding_model)

I just tested it quickly and it seems to be working well.

MaartenGr avatar Sep 20 '22 07:09 MaartenGr

Hi I am litle bit confused about Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model , is this talk about Bert Transformer cahya/bert-base-indonesian-522M where we cannot use since It not for feature extraction? or the paraphrase-multilingual-MiniLM-L12-v2 ?. Last question about SentenceTransformer model, although multilingual, might outperform it. since not a lot of transformer trained in Indonesia how I can the get supported model and can be used in bertopic in hugginface or sentence transformer ? or We can use all transformer model available in hugginface or sentence transformer or It need a specific requirement for the model to run? Thanks for you hard work

dwissaaj avatar Sep 21 '22 00:09 dwissaaj

Hi I am litle bit confused about Do note though that the model you are using it not optimized for semantic similarity and that a SentenceTransformer model , is this talk about Bert Transformer cahya/bert-base-indonesian-522M where we cannot use since It not for feature extraction? or the paraphrase-multilingual-MiniLM-L12-v2 ?.

Indeed, the cahya/bert-base-indonesian-522M model was not optimized for semantic similarity and since we want to cluster the documents, it definitely helps if the model was focused on that. In contrast, the paraphrase-multilingual-MiniLM-L12-v2 model is optimized for that purpose and generally works rather well.

Last question about SentenceTransformer model, although multilingual, might outperform it. since not a lot of transformer trained in Indonesia how I can the get supported model and can be used in bertopic in hugginface or sentence transformer ? or We can use all transformer model available in hugginface or sentence transformer or It need a specific requirement for the model to run? Thanks for you hard work

I am not entirely sure but I believe the paraphrase-multilingual-MiniLM-L12-v2 also works for Indonesian texts, so you can use that instead.

MaartenGr avatar Sep 21 '22 15:09 MaartenGr

Due to inactivity, I'll be closing this issue for now. Feel free to reach out if you want to continue this discussion or re-open the issue!

MaartenGr avatar Jan 09 '23 12:01 MaartenGr