azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

Form Recognizer (cf closed issue #24903). Run model on each page independently.

Open dev-td opened this issue 3 years ago • 1 comments

    > Current Invoice API assumes that whole input document is a single invoice and it does not do any document splitting. It selects value with highest confidence across all pages for each field. It is possible to use "pages" parameter of invoice API to run model for each page independently and implement some post custom post-processing logic to merge results (i.e. if InvoiceId is different on the next page, start new invoice).

Hi @anatolip , @vkurpad @catalinaperalta Hope this hasn't been covered in another thread. Can you please let me know what parameter I should use in "pages" parameter in order to get two analyzed documents instead of one for the use case depicted above (i.e a 2 pages pdf where each page is a single page document to be analyzed). So far i've tried the following :

pages = "1,2" pages = "1-1,2-2"

But unfortunately : len(results["analyzeResult"]["documents"]) equals 1

In order to reproduce the result explained above you can use the following code as well as the attached pdf file.

import aiohttp
import asyncio
import io


async def analyze_docs(data):
   
    # Config
    modelId = "prebuilt-invoice"
    pages = "1,2" # pages = 1,1 or pages = 1-1,2-2 or other ?
    url = f"https://westeurope.api.cognitive.microsoft.com/formrecognizer/documentModels/{modelId}:analyze?api-version=2022-08-31&pages={pages}"
    headers = {
        "Content-Type": "application/octet-stream",
        "Ocp-Apim-Subscription-Key": "...",
    }

    # Post request
    async with aiohttp.ClientSession(headers=headers) as session:
        while True:
            response = await session.post(url=url, data=data)
            if response.status in [200, 202]:
                callback = response.headers["Operation-Location"]
                break

    # Get results request
    headers = {"Ocp-Apim-Subscription-Key": "..."}
    async with aiohttp.ClientSession(headers=headers) as session:
        while True:
            response = await session.get(callback)
            results = await response.json()
            if "status" in results:
                if results["status"] == "succeeded":
                    break

    nb_docs = len(results["analyzeResult"]["documents"])  # Equals to 1 instead of 2 as expected.
    return nb_docs


async def main() -> None:

    path_to_file = f"2_docs_to_analyse.pdf"

    with open(path_to_file,"rb") as f:
        stream = f.read()

    # await
    await analyze_docs(stream)


if __name__ == "__main__":
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    asyncio.run(main())

2_docs_to_analyse.pdf

Originally posted by @dev-td in https://github.com/Azure/azure-sdk-for-python/issues/24903#issuecomment-1368061516

dev-td avatar Dec 30 '22 19:12 dev-td

Thanks for reaching out @dev-td! @anatolip and @vkurpad please share any guidance you have for this case. I see that in the older issue #24903, @anatolip mentioned that pages might be able to be used for this purpose, is that correct? If so, is the example above the recommended way to do this?

In the meantime, as a workaround it might be best to send 2 requests @dev-td, one with pages="1" and another request with pages="2".

catalinaperalta avatar Jan 04 '23 00:01 catalinaperalta

Hi @catalinaperalta , got an answer from Product Group team

“Currently the analyze document expects only a single document in the payload. If there are multiple documents in the file, only the first document is processed. The options are to split the PDF using open source tools (https://www.pdflabs.com/tools/pdftk-server) if you know the document ranges, or use the pages option in the request and send the same document with different page ranges in each request. The split and classify API is planned for February, this would help with files where the page ranges are not known, but will still currently require multiple requests. There will be an update in H2 of 2023 where this will be a single request solution”

dev-td avatar Jan 06 '23 16:01 dev-td