clean-text icon indicating copy to clipboard operation
clean-text copied to clipboard

Add multiprocessing

Open jdvala opened this issue 4 years ago • 4 comments

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

jdvala avatar Nov 22 '21 07:11 jdvala

I can help you do that if you agree.

jdvala avatar Nov 22 '21 08:11 jdvala

Hey @jdvala, this is good idea. I would suggest to use Python's multiprocessing, e.g. with a pool. What's your opinion on this?

jfilter avatar Nov 22 '21 17:11 jfilter

Hi @jfilter I have a few question that I would like to discuss before starting to implement this. If we enable multiprocessing we need to have a list of text and not just text, currently the clean function only excepts str.

  • Does it make sense to have another function completely which calls the clean functions?
  • Or do we make changes to the clean function?

I would recommend to go for the second option as people have gotten used to the current signature of the function and changes we change this, so in my opinion we have clean_parallel function which calls the clean function.

Secondly, if a single text is large enough, then breaking it and parallelizing it also makes sense.

At this point I am confused as which should we implement first.

jdvala avatar Jan 04 '22 15:01 jdvala

Hey @jdvala, in my opinion, the clean function should also accept a list of texts and then return a list of processed texts.

Then, we need a new parameter, e.g. n_jobs, to specify the number of maximum parallel jobs. This is how joblib is doing it. We may also use joblib to do the multiprocessing. Or take a look at https://github.com/Slimmer-AI/mpire since working with Python's multiprocessing feels clunky.

jfilter avatar Jan 05 '22 11:01 jfilter