Improve Preprocessing Speed
Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.
for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
# Lemmatize each token and convert to lower case if the token is not a pronoun
tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]
# Remove stop words and punctuation
tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
processed_documents.append(tokens)
I'm happy to contribute code to make this change.
Hi! Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.
If you want to contribute, feel free to open a pull request :) Thanks,
Silvia
Thanks - I'll work on this and submit a pull request.
Thank you! Let me know if you have any questions.
Silvia
how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt') method for preprocessing?