Cradicle Explorer

/ docs-website / docs / pipeline-components / converters / mistralocrdocumentconverter.mdx
mistralocrdocumentconverter.mdx
  1  ---
  2  title: "MistralOCRDocumentConverter"
  3  id: mistralocrdocumentconverter
  4  slug: "/mistralocrdocumentconverter"
  5  description: "`MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs."
  6  ---
  7  
  8  # MistralOCRDocumentConverter
  9  
 10  `MistralOCRDocumentConverter` extracts text from documents using Mistral's OCR API, with optional structured annotations for both individual image regions and full documents. It supports various input formats including local files, URLs, and Mistral file IDs.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx), or right at the beginning of an indexing pipeline |
 17  | **Mandatory init variables** | `api_key`: The Mistral API key. Can be set with `MISTRAL_API_KEY` environment variable. |
 18  | **Mandatory run variables** | `sources`: A list of document sources (file paths, ByteStreams, URLs, or Mistral chunks) |
 19  | **Output variables** | `documents`: A list of documents <br /> <br />`raw_mistral_response`: A list of raw OCR responses from Mistral API |
 20  | **API reference** | [Mistral](/reference/integrations-mistral) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  The `MistralOCRDocumentConverter` takes a list of document sources and uses Mistral's OCR API to extract text from images and PDFs. It supports multiple input formats:
 28  
 29  - **Local files**: File paths (str or Path) or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects
 30  - **Remote resources**: Document URLs, image URLs using Mistral's `DocumentURLChunk` and `ImageURLChunk`
 31  - **Mistral storage**: File IDs using Mistral's `FileChunk` for files previously uploaded to Mistral
 32  
 33  The component returns one Haystack [`Document`](../../concepts/data-classes.mdx#document) per source, with all pages concatenated using form feed characters (`\f`) as separators. This format ensures compatibility with Haystack's [`DocumentSplitter`](../preprocessors/documentsplitter.mdx) for accurate page-wise splitting and overlap handling. The content is returned in markdown format, with images represented as `![img-id](img-id)` tags.
 34  
 35  By default, the component uses the `MISTRAL_API_KEY` environment variable for authentication. You can also pass an `api_key` at initialization. Local files are automatically uploaded to Mistral's storage for processing and deleted afterward (configurable with `cleanup_uploaded_files`).
 36  
 37  When you initialize the component, you can optionally specify which pages to process, set limits on image extraction, configure minimum image sizes, or include base64-encoded images in the response. The default model is `"mistral-ocr-2505"`. See the [Mistral models documentation](https://docs.mistral.ai/getting-started/models/models_overview/) for available models.
 38  
 39  ### Structured Annotations
 40  
 41  A unique feature of `MistralOCRDocumentConverter` is its support for structured annotations using Pydantic schemas:
 42  
 43  - **Bounding box annotations** (`bbox_annotation_schema`): Annotate individual image regions with structured data (for example, image type, description, summary). These annotations are inserted inline after the corresponding image tags in the markdown content.
 44  - **Document annotations** (`document_annotation_schema`): Annotate the full document with structured data (for example, language, chapter titles, URLs). These annotations are unpacked into the document's metadata with a `source_` prefix (for example, `source_language`, `source_chapter_titles`).
 45  
 46  When annotation schemas are provided, the OCR model first extracts text and structure, then a Vision LLM analyzes the content and generates structured annotations according to your defined Pydantic schemas. Note that document annotation is limited to a maximum of 8 pages. For more details, see the [Mistral documentation on annotations](https://docs.mistral.ai/capabilities/document_ai/annotations/).
 47  
 48  :::note
 49  This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, and image tags. For page-aware chunking, connect the converter directly to `DocumentSplitter`, or disable those options if you need custom cleanup.
 50  :::
 51  
 52  ## Usage
 53  
 54  You need to install the `mistral-haystack` integration to use `MistralOCRDocumentConverter`:
 55  
 56  ```shell
 57  pip install mistral-haystack
 58  ```
 59  
 60  ### On its own
 61  
 62  Basic usage with a local file:
 63  
 64  ```python
 65  from pathlib import Path
 66  from haystack.utils import Secret
 67  from haystack_integrations.components.converters.mistral import (
 68      MistralOCRDocumentConverter,
 69  )
 70  
 71  converter = MistralOCRDocumentConverter(
 72      api_key=Secret.from_env_var("MISTRAL_API_KEY"),
 73      model="mistral-ocr-2505",
 74  )
 75  
 76  result = converter.run(sources=[Path("my_document.pdf")])
 77  documents = result["documents"]
 78  ```
 79  
 80  Processing multiple sources with different types:
 81  
 82  ```python
 83  from pathlib import Path
 84  from haystack.utils import Secret
 85  from haystack_integrations.components.converters.mistral import (
 86      MistralOCRDocumentConverter,
 87  )
 88  from mistralai.models import DocumentURLChunk, ImageURLChunk
 89  
 90  converter = MistralOCRDocumentConverter(
 91      api_key=Secret.from_env_var("MISTRAL_API_KEY"),
 92      model="mistral-ocr-2505",
 93  )
 94  
 95  sources = [
 96      Path("local_document.pdf"),
 97      DocumentURLChunk(document_url="https://example.com/document.pdf"),
 98      ImageURLChunk(image_url="https://example.com/receipt.jpg"),
 99  ]
100  
101  result = converter.run(sources=sources)
102  documents = result["documents"]  # List of 3 Documents
103  raw_responses = result["raw_mistral_response"]  # List of 3 raw responses
104  ```
105  
106  Using structured annotations:
107  
108  ```python
109  from pathlib import Path
110  from typing import List
111  from pydantic import BaseModel, Field
112  from haystack.utils import Secret
113  from haystack_integrations.components.converters.mistral import (
114      MistralOCRDocumentConverter,
115  )
116  from mistralai.models import DocumentURLChunk
117  
118  
119  # Define schema for image region annotations
120  class ImageAnnotation(BaseModel):
121      image_type: str = Field(..., description="The type of image content")
122      short_description: str = Field(
123          ...,
124          description="Short natural-language description",
125      )
126      summary: str = Field(..., description="Detailed summary of the image content")
127  
128  
129  # Define schema for document-level annotations
130  class DocumentAnnotation(BaseModel):
131      language: str = Field(..., description="Primary language of the document")
132      chapter_titles: List[str] = Field(
133          ...,
134          description="Detected chapter or section titles",
135      )
136      urls: List[str] = Field(..., description="URLs found in the text")
137  
138  
139  converter = MistralOCRDocumentConverter(
140      api_key=Secret.from_env_var("MISTRAL_API_KEY"),
141      model="mistral-ocr-2505",
142  )
143  
144  sources = [DocumentURLChunk(document_url="https://example.com/report.pdf")]
145  result = converter.run(
146      sources=sources,
147      bbox_annotation_schema=ImageAnnotation,
148      document_annotation_schema=DocumentAnnotation,
149  )
150  
151  documents = result["documents"]
152  # Document metadata will include:
153  # - source_language: extracted from DocumentAnnotation
154  # - source_chapter_titles: extracted from DocumentAnnotation
155  # - source_urls: extracted from DocumentAnnotation
156  # Document content will include inline image annotations
157  ```
158  
159  ### In a pipeline
160  
161  Here's an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:
162  
163  ```python
164  from haystack import Pipeline
165  from haystack.document_stores.in_memory import InMemoryDocumentStore
166  from haystack.components.preprocessors import DocumentSplitter
167  from haystack.components.writers import DocumentWriter
168  from haystack.utils import Secret
169  from haystack_integrations.components.converters.mistral import (
170      MistralOCRDocumentConverter,
171  )
172  
173  document_store = InMemoryDocumentStore()
174  
175  pipeline = Pipeline()
176  pipeline.add_component(
177      "converter",
178      MistralOCRDocumentConverter(
179          api_key=Secret.from_env_var("MISTRAL_API_KEY"),
180          model="mistral-ocr-2505",
181      ),
182  )
183  pipeline.add_component("splitter", DocumentSplitter(split_by="page", split_length=1))
184  pipeline.add_component("writer", DocumentWriter(document_store=document_store))
185  
186  pipeline.connect("converter", "splitter")
187  pipeline.connect("splitter", "writer")
188  
189  file_paths = ["invoice.pdf", "receipt.jpg", "contract.pdf"]
190  pipeline.run({"converter": {"sources": file_paths}})
191  ```