azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

[Document Intelligence] Stream response for files with large text content to prevent OOM event

Open kevinkupski opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. We see high memory usage in production (which leads to Out-of-Memory-Errors) when users upload files with a lot of textual content to our app which uses Document Intelligence. For a test file with ~200.000 characters, we have 240MB in memory allocated when calling poller.result(), but if we extract the relevant content (strings) it's only 10 MB.

It looks like the the relevant code for this is located here. Does anybody have an approach/idea to limit memory usage?

Describe the solution you'd like We'd like to reduce the data held in memory. It looks like the API does not provide this but we'd like to stream the result from Document Intelligence and process it chunk by chunk. Maybe as JSON Lines or any other streamable data format.

Describe alternatives you've considered Alternatively we only require the field paragraphs and could discard the rest of the response to reduce the size of the response – like a select on the fields of the response. This would not scale as good as the streaming approach, but might improve our current situation a bit.

kevinkupski avatar Oct 07 '24 16:10 kevinkupski

Thanks for the feedback, we’ll investigate asap.

xiangyan99 avatar Oct 07 '24 23:10 xiangyan99

There is currently no built-in mechanism to address this. A few possible workarounds:

  • Wrap the call to Document Intelligence in an Azure Function that returns only the desired content.
  • Instead of using the Python SDK, call the REST API directly, using a streaming JSON parser (ex. json_stream) to extract only the desired content.

bojunehsu avatar Oct 21 '24 16:10 bojunehsu

@bojunehsu thank you for the feedback and the hint about json_stream. Will have a look into that. 👍

kevinkupski avatar Oct 21 '24 16:10 kevinkupski

I have encountered the same issue, @kevinkupski have you please been able to get the json stream parser working?

marekratho avatar Dec 05 '24 08:12 marekratho

@marekratho unfortunately the issue got deprioritized internally, so I did not yet try to implement the workaround.

kevinkupski avatar Dec 06 '24 08:12 kevinkupski

Instead of using the Python SDK, call the REST API directly, using a streaming JSON parser (ex. json_stream) to extract only the desired content.

This is definitely a solution for now, in our case, we use ijson + Azure Rest APIs, and the memory is significantly reduced - the max usage is reduced from 1.8G to 400M for a 500+ pages pdf.

I think the problem with the Azure SDK is that the AnalyzeResult object contains many nested classes (such as figures, paragraphs, and tables), which are memory-intensive. It would be beneficial to have an option that directly returns the results as Python dictionaries instead of converting them into Azure-specific classes, which would reduce memory overhead.

shinxi avatar Dec 24 '24 10:12 shinxi

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @bojunehsu @vkurpad.

github-actions[bot] avatar Mar 27 '25 00:03 github-actions[bot]

@shinxi Thanks for the suggestion to use Python dictionaries. But unfortunately, that will not really address the problem since representing JSON using dictionaries also have significant overhead. You can consider using Pydantic to deserialize JSON to more compact Python data classes. But it may still be a problem if the initial JSON is large. Streaming may be the best option if you only need a subset of the JSON.

bojunehsu avatar Apr 02 '25 20:04 bojunehsu