dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Word cloud is broken

Open vitamins opened this issue 5 years ago • 1 comments

Describe the bug The word cloud shows differently sized words, even though every word is unique and only occurs once.

Python

df = pd.DataFrame({'name': [str(i) for i in range(10)]})
rep = plot(df, 'name')

Expected behavior Words with the same frequency should have the same size.

Screenshots wordCloud

Desktop

  • OS: Windows 10
  • Platform: Windows Powershell
  • Platform Version [e.g. 1.0]
  • Python Version Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
  • Dataprep Version: dataprep-0.2.11-py3-none-any.whl

vitamins avatar Aug 26 '20 06:08 vitamins

Hi @vitamins. Thanks for creating this issue! We are using the wordcloud library to create the word cloud, and apparently this phenomenon is a part of their algorithm https://github.com/amueller/word_cloud/issues/285. The solution they provide is to specify the max_font_size which does work: Screen Shot 2020-09-01 at 11 20 59 PM however, it is not easy to determine the optimal max_font_size for an arbitrary word cloud. The max_font_size used for the above plot will not work for longer strings: Screen Shot 2020-09-01 at 11 24 57 PM I think the optimal max_font_size is a function of the number of words and also the word lengths, which we could try to determine by trial and error. This could be particularly difficult since some characters are longer than others. However, we also show the word frequency bar chart which can give a more accurate reading of the frequency of each word.

brandonlockhart avatar Sep 02 '20 06:09 brandonlockhart