azuredocumentintelligenceconverter.mdx
1 --- 2 title: "AzureDocumentIntelligenceConverter" 3 id: azuredocumentintelligenceconverter 4 slug: "/azuredocumentintelligenceconverter" 5 description: "`AzureDocumentIntelligenceConverter` converts files to Documents using Azure's Document Intelligence service with GitHub Flavored Markdown output for better LLM/RAG integration. It supports PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML." 6 --- 7 8 # AzureDocumentIntelligenceConverter 9 10 `AzureDocumentIntelligenceConverter` converts files to Documents using Azure's Document Intelligence service with GitHub Flavored Markdown output for better LLM/RAG integration. It supports the following file formats: PDF (both searchable and image-only), JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx), or right at the beginning of an indexing pipeline | 17 | **Mandatory init variables** | `endpoint`: The endpoint URL of your Azure Document Intelligence resource <br /> <br />`api_key`: The API key for Azure authentication. Can be set with `AZURE_DI_API_KEY` environment variable. | 18 | **Mandatory run variables** | `sources`: A list of file paths or ByteStream objects | 19 | **Output variables** | `documents`: A list of documents <br /> <br />`raw_azure_response`: A list of raw responses from Azure | 20 | **API reference** | [Azure Document Intelligence](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence | 22 23 </div> 24 25 ## Overview 26 27 `AzureDocumentIntelligenceConverter` takes a list of file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects as input and uses Azure's Document Intelligence service to convert the files to a list of documents. Optionally, metadata can be attached to the documents through the `meta` input parameter. You need an active Azure account and a Document Intelligence or Cognitive Services resource to use this integration. Follow the steps described in the Azure [documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api) to set up your resource. 28 29 The component uses an `AZURE_DI_API_KEY` environment variable by default. Otherwise, you can pass an `api_key` at initialization — see code examples below. 30 31 This component uses the `azure-ai-documentintelligence` package (v1.0.0+) and outputs GitHub Flavored Markdown, preserving document structure such as headings, tables, and lists. Tables are rendered as inline markdown tables rather than being extracted as separate documents. 32 33 When you initialize the component, you can optionally set the `model_id`, which refers to the model you want to use. Available options include: 34 - `"prebuilt-document"`: General document analysis (default) 35 - `"prebuilt-read"`: Fast OCR for text extraction 36 - `"prebuilt-layout"`: Enhanced layout analysis with better table and structure detection 37 - Custom model IDs from your Azure resource 38 39 Refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature) for a full list of available models. 40 41 :::info 42 This component replaces the legacy [`AzureOCRDocumentConverter`](azureocrdocumentconverter.mdx), which uses the older `azure-ai-formrecognizer` package. The `AzureDocumentIntelligenceConverter` uses the newer `azure-ai-documentintelligence` SDK and produces Markdown output instead of plain text, making it better suited for LLM and RAG applications. 43 ::: 44 45 :::note 46 This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and lists. Connect the converter directly to your next component, or disable those options if you need custom cleanup. 47 ::: 48 49 ## Usage 50 51 You need to install the `azure-doc-intelligence-haystack` integration to use the `AzureDocumentIntelligenceConverter`: 52 53 ```shell 54 pip install azure-doc-intelligence-haystack 55 ``` 56 57 ### On its own 58 59 ```python 60 from pathlib import Path 61 62 from haystack_integrations.components.converters.azure_doc_intelligence import ( 63 AzureDocumentIntelligenceConverter, 64 ) 65 from haystack.utils import Secret 66 67 converter = AzureDocumentIntelligenceConverter( 68 endpoint="https://YOUR_RESOURCE.cognitiveservices.azure.com/", 69 api_key=Secret.from_env_var("AZURE_DI_API_KEY"), 70 ) 71 72 result = converter.run(sources=[Path("my_file.pdf")]) 73 documents = result["documents"] 74 ``` 75 76 ### In a pipeline 77 78 ```python 79 from haystack import Pipeline 80 from haystack.document_stores.in_memory import InMemoryDocumentStore 81 from haystack.components.preprocessors import DocumentSplitter 82 from haystack.components.writers import DocumentWriter 83 from haystack.utils import Secret 84 from haystack_integrations.components.converters.azure_doc_intelligence import ( 85 AzureDocumentIntelligenceConverter, 86 ) 87 88 document_store = InMemoryDocumentStore() 89 90 pipeline = Pipeline() 91 pipeline.add_component( 92 "converter", 93 AzureDocumentIntelligenceConverter( 94 endpoint="https://YOUR_RESOURCE.cognitiveservices.azure.com/", 95 api_key=Secret.from_env_var("AZURE_DI_API_KEY"), 96 ), 97 ) 98 pipeline.add_component( 99 "splitter", 100 DocumentSplitter(split_by="sentence", split_length=5), 101 ) 102 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 103 pipeline.connect("converter", "splitter") 104 pipeline.connect("splitter", "writer") 105 106 file_names = ["my_file.pdf"] 107 pipeline.run({"converter": {"sources": file_names}}) 108 ```