/ docs-website / reference_versioned_docs / version-2.26 / integrations-api / azure_doc_intelligence.md
azure_doc_intelligence.md
1 --- 2 title: "Azure Document Intelligence" 3 id: integrations-azure_doc_intelligence 4 description: "Azure Document Intelligence integration for Haystack" 5 slug: "/integrations-azure_doc_intelligence" 6 --- 7 8 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter"></a> 9 10 ## Module haystack\_integrations.components.converters.azure\_doc\_intelligence.converter 11 12 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter"></a> 13 14 ### AzureDocumentIntelligenceConverter 15 16 Converts files to Documents using Azure's Document Intelligence service. 17 18 This component uses the azure-ai-documentintelligence package (v1.0.0+) and outputs 19 GitHub Flavored Markdown for better integration with LLM/RAG applications. 20 21 Supported file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML. 22 23 Key features: 24 - Markdown output with preserved structure (headings, tables, lists) 25 - Inline table integration (tables rendered as markdown tables) 26 - Improved layout analysis and reading order 27 - Support for section headings 28 29 To use this component, you need an active Azure account 30 and a Document Intelligence or Cognitive Services resource. For setup instructions, see 31 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 32 33 ### Usage example 34 35 ```python 36 import os 37 from haystack_integrations.components.converters.azure_doc_intelligence import ( 38 AzureDocumentIntelligenceConverter, 39 ) 40 from haystack.utils import Secret 41 42 converter = AzureDocumentIntelligenceConverter( 43 endpoint=os.environ["AZURE_DI_ENDPOINT"], 44 api_key=Secret.from_env_var("AZURE_DI_API_KEY"), 45 ) 46 47 results = converter.run(sources=["invoice.pdf", "contract.docx"]) 48 documents = results["documents"] 49 50 # Documents contain markdown with inline tables 51 print(documents[0].content) 52 ``` 53 54 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.__init__"></a> 55 56 #### AzureDocumentIntelligenceConverter.\_\_init\_\_ 57 58 ```python 59 def __init__(endpoint: str, 60 *, 61 api_key: Secret = Secret.from_env_var("AZURE_DI_API_KEY"), 62 model_id: str = "prebuilt-document", 63 store_full_path: bool = False) 64 ``` 65 66 Creates an AzureDocumentIntelligenceConverter component. 67 68 **Arguments**: 69 70 - `endpoint`: The endpoint URL of your Azure Document Intelligence resource. 71 Example: "https://YOUR_RESOURCE.cognitiveservices.azure.com/" 72 - `api_key`: API key for Azure authentication. Can use Secret.from_env_var() 73 to load from AZURE_DI_API_KEY environment variable. 74 - `model_id`: Azure model to use for analysis. Options: 75 - "prebuilt-document": General document analysis (default) 76 - "prebuilt-read": Fast OCR for text extraction 77 - "prebuilt-layout": Enhanced layout analysis with better table/structure detection 78 - Custom model IDs from your Azure resource 79 - `store_full_path`: If True, stores complete file path in metadata. 80 If False, stores only the filename (default). 81 82 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.warm_up"></a> 83 84 #### AzureDocumentIntelligenceConverter.warm\_up 85 86 ```python 87 def warm_up() 88 ``` 89 90 Initializes the Azure Document Intelligence client. 91 92 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.run"></a> 93 94 #### AzureDocumentIntelligenceConverter.run 95 96 ```python 97 @component.output_types(documents=list[Document], 98 raw_azure_response=list[dict]) 99 def run( 100 sources: list[str | Path | ByteStream], 101 meta: dict[str, Any] | list[dict[str, Any]] | None = None 102 ) -> dict[str, list[Document] | list[dict]] 103 ``` 104 105 Convert a list of files to Documents using Azure's Document Intelligence service. 106 107 **Arguments**: 108 109 - `sources`: List of file paths or ByteStream objects. 110 - `meta`: Optional metadata to attach to the Documents. 111 This value can be either a list of dictionaries or a single dictionary. 112 If it's a single dictionary, its content is added to the metadata of all produced Documents. 113 If it's a list, the length of the list must match the number of sources, because the two lists will be 114 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 115 116 **Returns**: 117 118 A dictionary with the following keys: 119 - `documents`: List of created Documents 120 - `raw_azure_response`: List of raw Azure responses used to create the Documents 121 122 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.to_dict"></a> 123 124 #### AzureDocumentIntelligenceConverter.to\_dict 125 126 ```python 127 def to_dict() -> dict[str, Any] 128 ``` 129 130 Serializes the component to a dictionary. 131 132 **Returns**: 133 134 Dictionary with serialized data. 135 136 <a id="haystack_integrations.components.converters.azure_doc_intelligence.converter.AzureDocumentIntelligenceConverter.from_dict"></a> 137 138 #### AzureDocumentIntelligenceConverter.from\_dict 139 140 ```python 141 @classmethod 142 def from_dict(cls, data: dict[str, 143 Any]) -> "AzureDocumentIntelligenceConverter" 144 ``` 145 146 Deserializes the component from a dictionary. 147 148 **Arguments**: 149 150 - `data`: The dictionary to deserialize from. 151 152 **Returns**: 153 154 The deserialized component. 155