OCTIS icon indicating copy to clipboard operation
OCTIS copied to clipboard

Improve Preprocessing Speed

Open aneesha opened this issue 4 years ago • 4 comments

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

aneesha avatar Aug 31 '21 21:08 aneesha

Hi! Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.

If you want to contribute, feel free to open a pull request :) Thanks,

Silvia

silviatti avatar Sep 02 '21 15:09 silviatti

Thanks - I'll work on this and submit a pull request.

aneesha avatar Sep 03 '21 05:09 aneesha

Thank you! Let me know if you have any questions.

Silvia

silviatti avatar Sep 08 '21 11:09 silviatti

how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt') method for preprocessing?

SaraAmd avatar Feb 01 '23 02:02 SaraAmd