mistralocrdocumentconverter.mdx
1 --- 2 title: "MistralOCRDocumentConverter" 3 id: mistralocrdocumentconverter 4 slug: "/mistralocrdocumentconverter" 5 description: "`MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs." 6 --- 7 8 # MistralOCRDocumentConverter 9 10 `MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx), or right at the beginning of an indexing pipeline | 17 | **Mandatory init variables** | `api_key`: The Mistral API key. Can be set with `MISTRAL_API_KEY` environment variable. | 18 | **Mandatory run variables** | `sources`: A list of document sources (file paths, ByteStreams, URLs, or Mistral chunks) | 19 | **Output variables** | `documents`: A list of documents <br /> <br />`raw_mistral_response`: A list of raw OCR responses from Mistral API | 20 | **API reference** | [Mistral](/reference/integrations-mistral) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral | 22 23 </div> 24 25 ## Overview 26 27 The `MistralOCRDocumentConverter` takes a list of document sources and uses Mistral's OCR API to extract text from images and PDFs. It supports multiple input formats: 28 29 - **Local files**: File paths (str or Path) or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects 30 - **Remote resources**: Document URLs, image URLs using Mistral's `DocumentURLChunk` and `ImageURLChunk` 31 - **Mistral storage**: File IDs using Mistral's `FileChunk` for files previously uploaded to Mistral 32 33 The component returns one Haystack [`Document`](../../concepts/data-classes.mdx#document) per source, with all pages concatenated using form feed characters (`\f`) as separators. This format ensures compatibility with Haystack's [`DocumentSplitter`](../preprocessors/documentsplitter.mdx) for accurate page-wise splitting and overlap handling. The content is returned in markdown format, with images represented as `` tags. 34 35 By default, the component uses the `MISTRAL_API_KEY` environment variable for authentication. You can also pass an `api_key` at initialization. Local files are automatically uploaded to Mistral's storage for processing and deleted afterward (configurable with `cleanup_uploaded_files`). 36 37 When you initialize the component, you can optionally specify which pages to process, set limits on image extraction, configure minimum image sizes, or include base64-encoded images in the response. The default model is `"mistral-ocr-2505"`. See the [Mistral models documentation](https://docs.mistral.ai/getting-started/models/models_overview/) for available models. 38 39 ### Structured Annotations 40 41 A unique feature of `MistralOCRDocumentConverter` is its support for structured annotations using Pydantic schemas: 42 43 - **Bounding box annotations** (`bbox_annotation_schema`): Annotate individual image regions with structured data (for example, image type, description, summary). These annotations are inserted inline after the corresponding image tags in the markdown content. 44 - **Document annotations** (`document_annotation_schema`): Annotate the full document with structured data (for example, language, chapter titles, URLs). These annotations are unpacked into the document's metadata with a `source_` prefix (for example, `source_language`, `source_chapter_titles`). 45 46 When annotation schemas are provided, the OCR model first extracts text and structure, then a Vision LLM analyzes the content and generates structured annotations according to your defined Pydantic schemas. Note that document annotation is limited to a maximum of 8 pages. For more details, see the [Mistral documentation on annotations](https://docs.mistral.ai/capabilities/document_ai/annotations/). 47 48 :::note 49 This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup. 50 ::: 51 52 ## Usage 53 54 You need to install the `mistral-haystack` integration to use `MistralOCRDocumentConverter`: 55 56 ```shell 57 pip install mistral-haystack 58 ``` 59 60 ### On its own 61 62 Basic usage with a local file: 63 64 ```python 65 from pathlib import Path 66 from haystack.utils import Secret 67 from haystack_integrations.components.converters.mistral import ( 68 MistralOCRDocumentConverter, 69 ) 70 71 converter = MistralOCRDocumentConverter( 72 api_key=Secret.from_env_var("MISTRAL_API_KEY"), 73 model="mistral-ocr-2505", 74 ) 75 76 result = converter.run(sources=[Path("my_document.pdf")]) 77 documents = result["documents"] 78 ``` 79 80 Processing multiple sources with different types: 81 82 ```python 83 from pathlib import Path 84 from haystack.utils import Secret 85 from haystack_integrations.components.converters.mistral import ( 86 MistralOCRDocumentConverter, 87 ) 88 from mistralai.models import DocumentURLChunk, ImageURLChunk 89 90 converter = MistralOCRDocumentConverter( 91 api_key=Secret.from_env_var("MISTRAL_API_KEY"), 92 model="mistral-ocr-2505", 93 ) 94 95 sources = [ 96 Path("local_document.pdf"), 97 DocumentURLChunk(document_url="https://example.com/document.pdf"), 98 ImageURLChunk(image_url="https://example.com/receipt.jpg"), 99 ] 100 101 result = converter.run(sources=sources) 102 documents = result["documents"] # List of 3 Documents 103 raw_responses = result["raw_mistral_response"] # List of 3 raw responses 104 ``` 105 106 Using structured annotations: 107 108 ```python 109 from pathlib import Path 110 from typing import List 111 from pydantic import BaseModel, Field 112 from haystack.utils import Secret 113 from haystack_integrations.components.converters.mistral import ( 114 MistralOCRDocumentConverter, 115 ) 116 from mistralai.models import DocumentURLChunk 117 118 119 # Define schema for image region annotations 120 class ImageAnnotation(BaseModel): 121 image_type: str = Field(..., description="The type of image content") 122 short_description: str = Field( 123 ..., 124 description="Short natural-language description", 125 ) 126 summary: str = Field(..., description="Detailed summary of the image content") 127 128 129 # Define schema for document-level annotations 130 class DocumentAnnotation(BaseModel): 131 language: str = Field(..., description="Primary language of the document") 132 chapter_titles: List[str] = Field( 133 ..., 134 description="Detected chapter or section titles", 135 ) 136 urls: List[str] = Field(..., description="URLs found in the text") 137 138 139 converter = MistralOCRDocumentConverter( 140 api_key=Secret.from_env_var("MISTRAL_API_KEY"), 141 model="mistral-ocr-2505", 142 ) 143 144 sources = [DocumentURLChunk(document_url="https://example.com/report.pdf")] 145 result = converter.run( 146 sources=sources, 147 bbox_annotation_schema=ImageAnnotation, 148 document_annotation_schema=DocumentAnnotation, 149 ) 150 151 documents = result["documents"] 152 # Document metadata will include: 153 # - source_language: extracted from DocumentAnnotation 154 # - source_chapter_titles: extracted from DocumentAnnotation 155 # - source_urls: extracted from DocumentAnnotation 156 # Document content will include inline image annotations 157 ``` 158 159 ### In a pipeline 160 161 Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store: 162 163 ```python 164 from haystack import Pipeline 165 from haystack.document_stores.in_memory import InMemoryDocumentStore 166 from haystack.components.preprocessors import DocumentSplitter 167 from haystack.components.writers import DocumentWriter 168 from haystack.utils import Secret 169 from haystack_integrations.components.converters.mistral import ( 170 MistralOCRDocumentConverter, 171 ) 172 173 document_store = InMemoryDocumentStore() 174 175 pipeline = Pipeline() 176 pipeline.add_component( 177 "converter", 178 MistralOCRDocumentConverter( 179 api_key=Secret.from_env_var("MISTRAL_API_KEY"), 180 model="mistral-ocr-2505", 181 ), 182 ) 183 pipeline.add_component("splitter", DocumentSplitter(split_by="page", split_length=1)) 184 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 185 186 pipeline.connect("converter", "splitter") 187 pipeline.connect("splitter", "writer") 188 189 file_paths = ["invoice.pdf", "receipt.jpg", "contract.pdf"] 190 pipeline.run({"converter": {"sources": file_paths}}) 191 ```