replacing new line characters

Open ptear opened this issue 2 years ago • 1 comments

Hi Maarten,

I was just wondering what the reason is for following a different procedure for replacing \n characters with the UN dataset versus the Trump dataset https://github.com/MaartenGr/BERTopic_evaluation/blob/main/evaluation/data.py#L227.

I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?

Thanks for your efforts on BERTopic.

Aug 20 '23 15:08 ptear

It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.

I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.

Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.

Aug 21 '23 07:08 MaartenGr