spaCy
spaCy copied to clipboard
Batch processing does not speed up `en_core_web_trf`
How to reproduce the behaviour
spacy.prefer_gpu()
nlp = spacy.load(
"en_core_web_trf",
disable=['tagger', 'ner', 'lemmatizer', 'textcat']
)
node = """Some really long string, 3000 characters"""
# simulating 96 pretty long docs
nodes = [node*25]*96
Then, run each of the below lines separately and time it:
# 1 minute 7.5 s
[list(doc.sents) for doc in nlp.pipe(nodes, batch_size=96)]
# 1 minute 7.3 s
[list(doc.sents) for doc in nlp.pipe(nodes, batch_size=32)]
# 1 m 8.2 s
[list(doc.sents) for doc in nlp.pipe(nodes, batch_size=1)]
Running the same thing with en_core_web_lg results in substantial gains due to batching. Largest batch size is roughly 1/4 of the runtime of batch_size=1.
Your Environment
Using a single RTX A6000
python -m spacy info --markdown:
Info about spaCy
- spaCy version: 3.7.4
- Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Pipelines: en_core_web_lg (3.7.1), en_core_web_trf (3.7.3), en_core_web_sm (3.7.1), de_core_news_sm (3.7.0)
Expected Behavior
My understanding from the documentation and this issue is that we should expect significant gains from batching, as observed with en_core_web_lg. However, using en_core_web_trf does not yield significant gains from batching.
I'm wondering if this is a bug, or if we should not expect improved performance due to batching for a Transformer-Parser pipeline. Thanks for this awesome package, and in advance for your help!