Cradicle Explorer

/ docs-website / docs / pipeline-components / converters / azuredocumentintelligenceconverter.mdx
azuredocumentintelligenceconverter.mdx
  1  ---
  2  title: "AzureDocumentIntelligenceConverter"
  3  id: azuredocumentintelligenceconverter
  4  slug: "/azuredocumentintelligenceconverter"
  5  description: "`AzureDocumentIntelligenceConverter` converts files to Documents using Azure's Document Intelligence service with GitHub Flavored Markdown output for better LLM/RAG integration. It supports PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML."
  6  ---
  7  
  8  # AzureDocumentIntelligenceConverter
  9  
 10  `AzureDocumentIntelligenceConverter` converts files to Documents using Azure's Document Intelligence service with GitHub Flavored Markdown output for better LLM/RAG integration. It supports the following file formats: PDF (both searchable and image-only), JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx), or right at the beginning of an indexing pipeline |
 17  | **Mandatory init variables** | `endpoint`: The endpoint URL of your Azure Document Intelligence resource <br /> <br />`api_key`: The API key for Azure authentication. Can be set with `AZURE_DI_API_KEY` environment variable. |
 18  | **Mandatory run variables** | `sources`: A list of file paths or ByteStream objects |
 19  | **Output variables** | `documents`: A list of documents <br /> <br />`raw_azure_response`: A list of raw responses from Azure |
 20  | **API reference** | [Azure Document Intelligence](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  `AzureDocumentIntelligenceConverter` takes a list of file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects as input and uses Azure's Document Intelligence service to convert the files to a list of documents. Optionally, metadata can be attached to the documents through the `meta` input parameter. You need an active Azure account and a Document Intelligence or Cognitive Services resource to use this integration. Follow the steps described in the Azure [documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api) to set up your resource.
 28  
 29  The component uses an `AZURE_DI_API_KEY` environment variable by default. Otherwise, you can pass an `api_key` at initialization — see code examples below.
 30  
 31  This component uses the `azure-ai-documentintelligence` package (v1.0.0+) and outputs GitHub Flavored Markdown, preserving document structure such as headings, tables, and lists. Tables are rendered as inline markdown tables rather than being extracted as separate documents.
 32  
 33  When you initialize the component, you can optionally set the `model_id`, which refers to the model you want to use. Available options include:
 34  - `"prebuilt-document"`: General document analysis (default)
 35  - `"prebuilt-read"`: Fast OCR for text extraction
 36  - `"prebuilt-layout"`: Enhanced layout analysis with better table and structure detection
 37  - Custom model IDs from your Azure resource
 38  
 39  Refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature) for a full list of available models.
 40  
 41  :::info
 42  This component replaces the legacy [`AzureOCRDocumentConverter`](azureocrdocumentconverter.mdx), which uses the older `azure-ai-formrecognizer` package. The `AzureDocumentIntelligenceConverter` uses the newer `azure-ai-documentintelligence` SDK and produces Markdown output instead of plain text, making it better suited for LLM and RAG applications.
 43  :::
 44  
 45  :::note
 46  This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and lists. Connect the converter directly to your next component, or disable those options if you need custom cleanup.
 47  :::
 48  
 49  ## Usage
 50  
 51  You need to install the `azure-doc-intelligence-haystack` integration to use the `AzureDocumentIntelligenceConverter`:
 52  
 53  ```shell
 54  pip install azure-doc-intelligence-haystack
 55  ```
 56  
 57  ### On its own
 58  
 59  ```python
 60  from pathlib import Path
 61  
 62  from haystack_integrations.components.converters.azure_doc_intelligence import (
 63      AzureDocumentIntelligenceConverter,
 64  )
 65  from haystack.utils import Secret
 66  
 67  converter = AzureDocumentIntelligenceConverter(
 68      endpoint="https://YOUR_RESOURCE.cognitiveservices.azure.com/",
 69      api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
 70  )
 71  
 72  result = converter.run(sources=[Path("my_file.pdf")])
 73  documents = result["documents"]
 74  ```
 75  
 76  ### In a pipeline
 77  
 78  ```python
 79  from haystack import Pipeline
 80  from haystack.document_stores.in_memory import InMemoryDocumentStore
 81  from haystack.components.preprocessors import DocumentSplitter
 82  from haystack.components.writers import DocumentWriter
 83  from haystack.utils import Secret
 84  from haystack_integrations.components.converters.azure_doc_intelligence import (
 85      AzureDocumentIntelligenceConverter,
 86  )
 87  
 88  document_store = InMemoryDocumentStore()
 89  
 90  pipeline = Pipeline()
 91  pipeline.add_component(
 92      "converter",
 93      AzureDocumentIntelligenceConverter(
 94          endpoint="https://YOUR_RESOURCE.cognitiveservices.azure.com/",
 95          api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
 96      ),
 97  )
 98  pipeline.add_component(
 99      "splitter",
100      DocumentSplitter(split_by="sentence", split_length=5),
101  )
102  pipeline.add_component("writer", DocumentWriter(document_store=document_store))
103  pipeline.connect("converter", "splitter")
104  pipeline.connect("splitter", "writer")
105  
106  file_names = ["my_file.pdf"]
107  pipeline.run({"converter": {"sources": file_names}})
108  ```