modified get_stop_words(), preventing being changed from outside.
Dear Alir3z4,
I used this repo for the work at my previous company, and I found one issue with the function get_stop_words():
if we obtain the list in variable and modifiy the list variable, like:
en_stop_words = get_stop_words('en')
en_stop_words.append('harrypotter')
then the return list from get_stop_words() will also be changed:
'harrypotter' in get_stop_words('en') # True
This will raise a mistake when we call the function get_stop_words('en') many times recursively, like:
en_stop_words_again = get_stop_words('en')
'harrypotter' in en_stop_words_again # True
To solve this issue, of course the user can use copy.deepcopy(get_stop_words('en')), however this may not be noticed by the user.
Thus I added a copy in the function get_stop_words('en'), namely:
replacing:
return stop_words
by:
return stop_words[:]
and as a result:
en_stop_words = get_stop_words('en')
en_stop_words.append('harrypotter')
en_stop_words_again = get_stop_words('en')
'harrypotter' in en_stop_words # True
'harrypotter' in get_stop_words('en') # False
'harrypotter' in en_stop_words_again # False
And I have tested the performance before and after, see:
-
before: https://github.com/yyanhan/python-stop-words/blob/example/test_before.ipynb
-
after: https://github.com/yyanhan/python-stop-words/blob/example/test_after.ipynb
I hope this PR can make it better!
Best Han