docling icon indicating copy to clipboard operation
docling copied to clipboard

Issue: docling-serve GPU Memory Leak

Open sir3mat opened this issue 11 months ago • 5 comments

Question: docling-serve GPU Memory Leak

Description: docling-serve exhibits steadily increasing GPU memory usage over time when processing a consistent stream of documents, suggesting a memory leak. This leads to potential OOM errors.

Expected: GPU memory should plateau after model loading and remain relatively stable.

Actual: GPU memory usage continuously increases, as seen via nvidia-smi.

Steps to Reproduce:

Set up docling-serve with GPU backend (e.g., pdf_backend: dlparse_v2) and OCR enabled (do_ocr: true).
Send repeated requests to /v1alpha/convert/source or /v1alpha/convert/file with PDF documents. Example curl commands are provided in the original, longer issue description.
Monitor GPU memory usage with nvidia-smi.

How to avoid this for GPU limited resource environments? I got an h100 with mig of 12GB VRAM but after 4 pdf is starts to throw OOM

docling serve request params I use docling-serve with this params params = { "from_formats": [ "docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "csv", "xlsx", "xml_uspto", "xml_jats", "json_docling", ], "to_formats": ["md"], "image_export_mode": "placeholder", "do_ocr": True, "force_ocr": False, "ocr_engine": "easyocr", "ocr_lang": None, "pdf_backend": "dlparse_v2", "table_mode": "accurate", "abort_on_error": False, "return_as_file": False, "do_table_structure": True, "include_images": True, "images_scale": 2.0, }

sir3mat avatar Feb 28 '25 14:02 sir3mat

We run tests internally. The infrastructure is an Openshift cluster. Pod reservation specs:

  • single A10 GPU
  • cpu 9
  • ram 24Gb

The settings that were used: "options": { "from_formats": [ "docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx", "xml_uspto", "xml_jats", "json_docling" ], "to_formats": ["md"], "image_export_mode": "placeholder", "do_ocr": true, "force_ocr": false, "ocr_engine": "easyocr", "ocr_lang": [], "pdf_backend": "dlparse_v2", "table_mode": "accurate", "abort_on_error": false, "return_as_file": false, "do_table_structure": true, "include_images": true, "images_scale": 2 },

Tried to convert several single documents: https://arxiv.org/pdf/2206.01062 https://arxiv.org/pdf/2501.17887 https://arxiv.org/pdf/2411.19710

And then all three of them as a single payload.

The observed behavior is following using nvidia-smi:

  • memory mark goes up at first request
  • if the same document is submitted again and again, the mark stays of the same level
  • when new documents submitted, depending on the document, the mark can go up or stay as it was before
  • when all three documents were submitted the mark stayed at 3229MiB
  • the mark doesn't change if system is idling

I'm not sure if memory mark in nvidia-smi is reporting actual consumption or what was "booked" or just "touched", specifically in case of using docling-serve.

The payload was rather similar to each other and potentially didn't trigger the issue. @sir3mat could you share with us documents on which you observed this issue, of course as long as these documents are publicly available and accessible. Otherwise perhaps a description of the content could help, like are they heavy on images, tables, perhaps they need full OCR and etc.

vku-ibm avatar Mar 06 '25 12:03 vku-ibm

Additional findings. There is definite trend of moving the GPU memory mark higher, when more documents are submitted. Most likely caused by one document somehow cause more memory being reserved, compared to other documents.

Additionally, currently when you are changing conversion options, against the same instance, the instance will cache settings object and there is no limit to the size of that cache. I've tried about 20 different permutations of the conversion options with the same document and while GPU memory mark does go up with some options more than others, it doesn't seam to "accumulate" or depend on the size of the cache. What does get effected is the RAM consumption. If there would be more options that can cause more permutation in models that are loaded on GPU, we potentially could see a larger impact.

So far, the highest mark I've got is 12691MiB, with following settings and payload:

` "do_ocr": true, "force_ocr": true, "ocr_engine": "easyocr", "ocr_lang": [], "pdf_backend": "dlparse_v2", "table_mode": "accurate", "abort_on_error": false, "return_as_file": false, "do_table_structure": true, "include_images": true, "images_scale": 2

"http_sources": [ {"url": "https://arxiv.org/pdf/2206.01062"}, {"url": "https://arxiv.org/pdf/2501.17887"}, {"url": "https://arxiv.org/pdf/2411.19710"}, {"url": "https://arxiv.org/pdf/2409.18164"}, {"url": "https://arxiv.org/pdf/2408.09869"}, {"url": "https://arxiv.org/pdf/2406.19102"}, {"url": "https://arxiv.org/pdf/2405.10725"}, {"url": "https://arxiv.org/pdf/2305.14962"}, {"url": "https://arxiv.org/pdf/2209.03648"}, {"url": "https://arxiv.org/pdf/2206.00785"}] `

vku-ibm avatar Mar 06 '25 15:03 vku-ibm

  • single A10 GPU
  • cpu 9
  • ram 24Gb

By the way, may I ask how much time it takes to process a pdf with such resources?

archasek avatar Mar 06 '25 15:03 archasek

  • single A10 GPU

  • cpu 9

  • ram 24Gb

Note that these are just the reservation values. The actual usage is 1.5GB-2GB of RAM memory.

dolfim-ibm avatar Mar 06 '25 15:03 dolfim-ibm

@vku-ibm Any updates on this?

viktorlarsson avatar May 22 '25 08:05 viktorlarsson