spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Tokenizer uses a significant amount of memory compared to the input

Open itamarst opened this issue 3 years ago • 1 comments

How to reproduce the behaviour

Download https://www.gutenberg.org/files/1342/1342-0.txt — Pride & Prejudice, about 0.8MB.

Then run:

import spacy

nlp = spacy.load("en_core_web_sm")
with open("./1342-0.txt") as f:
    book = f.read()
    result = nlp.tokenizer(book)

Running this with Fil memory profiler (fil-profile run example.py) shows the tokenizer uses 30MB of RAM to process the input file (the rightmost column). In other words, memory usage is 15-30× the original file size, taking into account the uncertainty introduced by doubling logic in _realloc.

Screenshot of Fil output

Your Environment

  • spaCy version: 3.4.1
  • Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.35
  • Python version: 3.10.4
  • Pipelines: en_core_web_sm (3.4.0)

Some ideas on solving this

Basically the memory usage seems to be tied to the tokens array of TokenC objects. Shrinking TokenC is the straightforward approach:

  • Change the order of the fields so fields are in declining order of size, so alignment requirements don't add unnecessary padding. See https://lwn.net/Articles/335942/ for example of padding increasing memory.
  • Some of the fields on TokenC could presumably be switched to smaller types, e.g. uint32_t (or perhaps even 16 or 8 in some cases) instead of uint64_t.
  • My vague impression is that a TokenC can store different information about different types of tokens. So it has fields that are used for one kind, but not another, and vice versa. Switching to a union instead of one big struct would reduce memory usage from the sum of all variants to the max of all variants.

Is this a significant problem?

I'm not sure. But I imagine sometimes people try to tokenize large documents, and e.g. a 100MB input would probably take 3GB of RAM and that's starting to add up. User could split the documents up themselves, before tokenizing, but that is a bit strange given what a tokenizer does :grin:

itamarst avatar Aug 11 '22 17:08 itamarst

Thanks for the report and the suggestions! This is indeed something that we'd like to look into for the next major release of spaCy, where we have more latitude to make ABI-breaking changes to the library.

shadeMe avatar Aug 12 '22 08:08 shadeMe