tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

remove use of parallel iterators except in batch methods

Open epwalsh opened this issue 5 years ago • 3 comments

This is an alternative to #306. It simply removes the use of parallel iterators except within the batch methods (encode_batch, decode_batch). The result would be that the non-batch versions of encode/decode would be safe to use before and after forking.

Surprisingly, this actually improves performance in the encode benchmarks across the board by a HUGE margin in some cases. So this could be a good thing not just for Python safety, but for performance in general.

image

epwalsh avatar Jun 17 '20 22:06 epwalsh

#187

epwalsh avatar Jun 17 '20 22:06 epwalsh

We should probably do some more benchmarks for this. This is indeed surprising, but I guess it is highly dependant on the different use cases, and might not reflect the reality.

I was thinking about maybe having some way to limit the use of the parallel iterator only in cases where there are enough stuff to process. Maybe using the same method that we added in #311, while providing a minimum size to activate it for example. What do you think?

n1t0 avatar Jun 29 '20 16:06 n1t0

I agree. I guess we could add some benchmarks that vary the size of the input sequence to see if there's an obvious cutoff where parallelization helps.

epwalsh avatar Jun 29 '20 17:06 epwalsh