Tokenizer uses a significant amount of memory compared to the input
How to reproduce the behaviour
Download https://www.gutenberg.org/files/1342/1342-0.txt — Pride & Prejudice, about 0.8MB.
Then run:
import spacy
nlp = spacy.load("en_core_web_sm")
with open("./1342-0.txt") as f:
book = f.read()
result = nlp.tokenizer(book)
Running this with Fil memory profiler (fil-profile run example.py) shows the tokenizer uses 30MB of RAM to process the input file (the rightmost column). In other words, memory usage is 15-30× the original file size, taking into account the uncertainty introduced by doubling logic in _realloc.

Your Environment
- spaCy version: 3.4.1
- Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.35
- Python version: 3.10.4
- Pipelines: en_core_web_sm (3.4.0)
Some ideas on solving this
Basically the memory usage seems to be tied to the tokens array of TokenC objects. Shrinking TokenC is the straightforward approach:
- Change the order of the fields so fields are in declining order of size, so alignment requirements don't add unnecessary padding. See https://lwn.net/Articles/335942/ for example of padding increasing memory.
- Some of the fields on
TokenCcould presumably be switched to smaller types, e.g. uint32_t (or perhaps even 16 or 8 in some cases) instead of uint64_t. - My vague impression is that a
TokenCcan store different information about different types of tokens. So it has fields that are used for one kind, but not another, and vice versa. Switching to a union instead of one big struct would reduce memory usage from the sum of all variants to the max of all variants.
Is this a significant problem?
I'm not sure. But I imagine sometimes people try to tokenize large documents, and e.g. a 100MB input would probably take 3GB of RAM and that's starting to add up. User could split the documents up themselves, before tokenizing, but that is a bit strange given what a tokenizer does :grin:
Thanks for the report and the suggestions! This is indeed something that we'd like to look into for the next major release of spaCy, where we have more latitude to make ABI-breaking changes to the library.