scispacy
scispacy copied to clipboard
Clean up vocab creation
This script is getting quite a few steps removed from the original corpus now. It might be better to convert this to a script which reads a large corpus and creates the vocabularies directly, rather than us having created this intermediate file with the word/doc counts in it, and then having this one generate a vocabulary file which is not substantially different apart from how it is filtered.
Originally posted by @DeNeutoy in https://github.com/allenai/scispacy/pull/295#discussion_r558664660