BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Potential Bug for vectorizer_model

Open yuanjames opened this issue 1 year ago • 3 comments

Hi,

Again, thanks for your amazing work for Bertopic. I used Bertopic for many research projects.

I recently noticed that there may be a tricky bug for vectorizer_model. I checked the code, I found when I pass a cutomised vectorizer_model into BerTopic, the n_gram_range defined in BerTopic class will not be passed to it. Instead, we need to pass both arguments to the vectorizer_model when the vectorizer_model creats.

yuanjames avatar Mar 10 '24 16:03 yuanjames

Thanks for sharing! This is actually not a bug but by design. The underlying idea is that users not familiar with the CountVectorizer can directly use the n_gram_range parameter. However, when using the vectorizer_model it should overwrite n_gram_range since you creating your own custom vectorizer model. Other parameters related to that should have no effect.

In other words, either you use the n_gram_range parameter directly from BERTopic or via vectorizer_model but never both.

MaartenGr avatar Mar 10 '24 16:03 MaartenGr

Okay, I see, that's good, thanks for your always instant reply, cool design, you may consider adding some notes in doc. Thanks again.

yuanjames avatar Mar 10 '24 16:03 yuanjames

No problem! It's actually already there 😉

https://github.com/MaartenGr/BERTopic/blob/8985f26d4ee89b4c512ff9da22a61371c20605b8/bertopic/_bertopic.py#L155-L159

MaartenGr avatar Mar 10 '24 17:03 MaartenGr