feat: Add a Azure OCR Converter that uses the `azure-ai-documentintelligence` library
The AzureConverter (in Haystack v1) and the AzureOCRConverter (in Haystack v2) use the azure-ai-formrecognizer package. A new package azure-ai-documentintelligence has been released about 8 months ago that is meant to replace the former. We should migrate to the new package since it offers new features and will be the one Microsoft continues to support moving forward.
For example the new package supports the returning a file (using the prebuilt-layout model) in Markdown format. See details here. This was explicitly added by Microsoft to better support passing the OCR output to LLMs.
Here are other add-on capabilities: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-add-on-capabilities?view=doc-intel-4.0.0&tabs=rest-api#high-resolution-extraction
Pricing is more expensive when using add-on capabilities (e.g. OCR High Resolution): https://azure.microsoft.com/en-au/pricing/details/ai-document-intelligence/
Simultaneously we should bring in the changes that were made to improve the Haystack v1 Azure Converter (which were completed a short time after the v2 version was ported from Haystack v1). Changes were made here: https://github.com/deepset-ai/deepset-cloud-custom-nodes/pull/267
To make this easier overall, I'd advocate for creating a new component called AzureDocumentIntelligenceConverter that brings in both sets of changes and then we can deprecate the old one.
We could also think about adding this to the azure_ai_search core-integration instead of this being in haystack main.
cc @ju-gu who requested we bump this because of need for a client
As this is a component which is used quite a lot, it would be great if we could update it, so it uses the new package and also is able to convert pdf to txt with inline tables in csv format. We found that this works pretty well in RAG applications.
@julian-risch +1 for this, needed for our table processing pipeline for clients
AzureOCRConverter could be moved to Haystack core integrations while we keep ChatGenerator in Haystack core.
Do we have a timeline to update this component? As it is frequently being used, I think it would be cool if we could take care of it soon.