Form Recognizer (cf closed issue #24903). Run model on each page independently.
> Current Invoice API assumes that whole input document is a single invoice and it does not do any document splitting. It selects value with highest confidence across all pages for each field. It is possible to use "pages" parameter of invoice API to run model for each page independently and implement some post custom post-processing logic to merge results (i.e. if InvoiceId is different on the next page, start new invoice).
Hi @anatolip , @vkurpad @catalinaperalta Hope this hasn't been covered in another thread. Can you please let me know what parameter I should use in "pages" parameter in order to get two analyzed documents instead of one for the use case depicted above (i.e a 2 pages pdf where each page is a single page document to be analyzed). So far i've tried the following :
pages = "1,2"
pages = "1-1,2-2"
But unfortunately :
len(results["analyzeResult"]["documents"]) equals 1
In order to reproduce the result explained above you can use the following code as well as the attached pdf file.
import aiohttp
import asyncio
import io
async def analyze_docs(data):
# Config
modelId = "prebuilt-invoice"
pages = "1,2" # pages = 1,1 or pages = 1-1,2-2 or other ?
url = f"https://westeurope.api.cognitive.microsoft.com/formrecognizer/documentModels/{modelId}:analyze?api-version=2022-08-31&pages={pages}"
headers = {
"Content-Type": "application/octet-stream",
"Ocp-Apim-Subscription-Key": "...",
}
# Post request
async with aiohttp.ClientSession(headers=headers) as session:
while True:
response = await session.post(url=url, data=data)
if response.status in [200, 202]:
callback = response.headers["Operation-Location"]
break
# Get results request
headers = {"Ocp-Apim-Subscription-Key": "..."}
async with aiohttp.ClientSession(headers=headers) as session:
while True:
response = await session.get(callback)
results = await response.json()
if "status" in results:
if results["status"] == "succeeded":
break
nb_docs = len(results["analyzeResult"]["documents"]) # Equals to 1 instead of 2 as expected.
return nb_docs
async def main() -> None:
path_to_file = f"2_docs_to_analyse.pdf"
with open(path_to_file,"rb") as f:
stream = f.read()
# await
await analyze_docs(stream)
if __name__ == "__main__":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
asyncio.run(main())
Originally posted by @dev-td in https://github.com/Azure/azure-sdk-for-python/issues/24903#issuecomment-1368061516
Thanks for reaching out @dev-td! @anatolip and @vkurpad please share any guidance you have for this case. I see that in the older issue #24903, @anatolip mentioned that pages might be able to be used for this purpose, is that correct? If so, is the example above the recommended way to do this?
In the meantime, as a workaround it might be best to send 2 requests @dev-td, one with pages="1" and another request with pages="2".
Hi @catalinaperalta , got an answer from Product Group team
“Currently the analyze document expects only a single document in the payload. If there are multiple documents in the file, only the first document is processed. The options are to split the PDF using open source tools (https://www.pdflabs.com/tools/pdftk-server) if you know the document ranges, or use the pages option in the request and send the same document with different page ranges in each request. The split and classify API is planned for February, this would help with files where the page ranges are not known, but will still currently require multiple requests. There will be an update in H2 of 2023 where this will be a single request solution”