Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / integrations-api / azure_doc_intelligence.md
azure_doc_intelligence.md
  1  ---
  2  title: "Azure Document Intelligence"
  3  id: integrations-azure_doc_intelligence
  4  description: "Azure Document Intelligence integration for Haystack"
  5  slug: "/integrations-azure_doc_intelligence"
  6  ---
  7  
  8  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter"></a>
  9  
 10  ## Module haystack\_integrations.components.converters.azure\_doc\_intelligence.converter
 11  
 12  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter"></a>
 13  
 14  ### AzureDocumentIntelligenceConverter
 15  
 16  Converts files to Documents using Azure's Document Intelligence service.
 17  
 18  This component uses the azure-ai-documentintelligence package (v1.0.0+) and outputs
 19  GitHub Flavored Markdown for better integration with LLM/RAG applications.
 20  
 21  Supported file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
 22  
 23  Key features:
 24  - Markdown output with preserved structure (headings, tables, lists)
 25  - Inline table integration (tables rendered as markdown tables)
 26  - Improved layout analysis and reading order
 27  - Support for section headings
 28  
 29  To use this component, you need an active Azure account
 30  and a Document Intelligence or Cognitive Services resource. For setup instructions, see
 31  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
 32  
 33  ### Usage example
 34  
 35  ```python
 36  import os
 37  from haystack_integrations.components.converters.azure_doc_intelligence import (
 38      AzureDocumentIntelligenceConverter,
 39  )
 40  from haystack.utils import Secret
 41  
 42  converter = AzureDocumentIntelligenceConverter(
 43      endpoint=os.environ["AZURE_DI_ENDPOINT"],
 44      api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
 45  )
 46  
 47  results = converter.run(sources=["invoice.pdf", "contract.docx"])
 48  documents = results["documents"]
 49  
 50  # Documents contain markdown with inline tables
 51  print(documents[0].content)
 52  ```
 53  
 54  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.__init__"></a>
 55  
 56  #### AzureDocumentIntelligenceConverter.\_\_init\_\_
 57  
 58  ```python
 59  def __init__(endpoint: str,
 60               *,
 61               api_key: Secret = Secret.from_env_var("AZURE_DI_API_KEY"),
 62               model_id: str = "prebuilt-document",
 63               store_full_path: bool = False)
 64  ```
 65  
 66  Creates an AzureDocumentIntelligenceConverter component.
 67  
 68  **Arguments**:
 69  
 70  - `endpoint`: The endpoint URL of your Azure Document Intelligence resource.
 71  Example: "https://YOUR_RESOURCE.cognitiveservices.azure.com/"
 72  - `api_key`: API key for Azure authentication. Can use Secret.from_env_var()
 73  to load from AZURE_DI_API_KEY environment variable.
 74  - `model_id`: Azure model to use for analysis. Options:
 75  - "prebuilt-document": General document analysis (default)
 76  - "prebuilt-read": Fast OCR for text extraction
 77  - "prebuilt-layout": Enhanced layout analysis with better table/structure detection
 78  - Custom model IDs from your Azure resource
 79  - `store_full_path`: If True, stores complete file path in metadata.
 80  If False, stores only the filename (default).
 81  
 82  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.warm_up"></a>
 83  
 84  #### AzureDocumentIntelligenceConverter.warm\_up
 85  
 86  ```python
 87  def warm_up()
 88  ```
 89  
 90  Initializes the Azure Document Intelligence client.
 91  
 92  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.run"></a>
 93  
 94  #### AzureDocumentIntelligenceConverter.run
 95  
 96  ```python
 97  @component.output_types(documents=list[Document],
 98                          raw_azure_response=list[dict])
 99  def run(
100      sources: list[str | Path | ByteStream],
101      meta: dict[str, Any] | list[dict[str, Any]] | None = None
102  ) -> dict[str, list[Document] | list[dict]]
103  ```
104  
105  Convert a list of files to Documents using Azure's Document Intelligence service.
106  
107  **Arguments**:
108  
109  - `sources`: List of file paths or ByteStream objects.
110  - `meta`: Optional metadata to attach to the Documents.
111  This value can be either a list of dictionaries or a single dictionary.
112  If it's a single dictionary, its content is added to the metadata of all produced Documents.
113  If it's a list, the length of the list must match the number of sources, because the two lists will be
114  zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
115  
116  **Returns**:
117  
118  A dictionary with the following keys:
119  - `documents`: List of created Documents
120  - `raw_azure_response`: List of raw Azure responses used to create the Documents
121  
122  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.to_dict"></a>
123  
124  #### AzureDocumentIntelligenceConverter.to\_dict
125  
126  ```python
127  def to_dict() -> dict[str, Any]
128  ```
129  
130  Serializes the component to a dictionary.
131  
132  **Returns**:
133  
134  Dictionary with serialized data.
135  
136  <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.from_dict"></a>
137  
138  #### AzureDocumentIntelligenceConverter.from\_dict
139  
140  ```python
141  @classmethod
142  def from_dict(cls, data: dict[str,
143                                Any]) -> "AzureDocumentIntelligenceConverter"
144  ```
145  
146  Deserializes the component from a dictionary.
147  
148  **Arguments**:
149  
150  - `data`: The dictionary to deserialize from.
151  
152  **Returns**:
153  
154  The deserialized component.
155