Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.27 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: "Converters"
   3  id: converters-api
   4  description: "Various converters to transform data from one format to another."
   5  slug: "/converters-api"
   6  ---
   7  
   8  
   9  ## azure
  10  
  11  ### AzureOCRDocumentConverter
  12  
  13  Converts files to documents using Azure's Document Intelligence service.
  14  
  15  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  16  
  17  To use this component, you need an active Azure account
  18  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  19  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  20  
  21  ### Usage example
  22  
  23  ```python
  24  import os
  25  from datetime import datetime
  26  from haystack.components.converters import AzureOCRDocumentConverter
  27  from haystack.utils import Secret
  28  
  29  converter = AzureOCRDocumentConverter(
  30      endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
  31      api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
  32  )
  33  results = converter.run(
  34      sources=["test/test_files/pdf/react_paper.pdf"],
  35      meta={"date_added": datetime.now().isoformat()},
  36  )
  37  documents = results["documents"]
  38  print(documents[0].content)
  39  # 'This is a text from the PDF file.'
  40  ```
  41  
  42  #### __init__
  43  
  44  ```python
  45  __init__(
  46      endpoint: str,
  47      api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  48      model_id: str = "prebuilt-read",
  49      preceding_context_len: int = 3,
  50      following_context_len: int = 3,
  51      merge_multiple_column_headers: bool = True,
  52      page_layout: Literal["natural", "single_column"] = "natural",
  53      threshold_y: float | None = 0.05,
  54      store_full_path: bool = False,
  55  ) -> None
  56  ```
  57  
  58  Creates an AzureOCRDocumentConverter component.
  59  
  60  **Parameters:**
  61  
  62  - **endpoint** (<code>str</code>) – The endpoint of your Azure resource.
  63  - **api_key** (<code>Secret</code>) – The API key of your Azure resource.
  64  - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation]
  65    (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  66  - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context
  67    (this will be added to the metadata).
  68  - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context (
  69    this will be added to the metadata).
  70  - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row.
  71  - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options:
  72  - `natural`: Uses the natural reading order determined by Azure.
  73  - `single_column`: Groups all lines with the same height on the page based on a threshold
  74    determined by `threshold_y`.
  75  - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`.
  76    The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  77    single line. This is crucial for section headers or numbers which may be spatially separated
  78    from the remaining text on the horizontal axis.
  79  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
  80    If False, only the file name is stored.
  81  
  82  #### run
  83  
  84  ```python
  85  run(
  86      sources: list[str | Path | ByteStream],
  87      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
  88  ) -> dict[str, Any]
  89  ```
  90  
  91  Convert a list of files to Documents using Azure's Document Intelligence service.
  92  
  93  **Parameters:**
  94  
  95  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
  96  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
  97    This value can be either a list of dictionaries or a single dictionary.
  98    If it's a single dictionary, its content is added to the metadata of all produced Documents.
  99    If it's a list, the length of the list must match the number of sources, because the two lists will be
 100    zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 101  
 102  **Returns:**
 103  
 104  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 105  - `documents`: List of created Documents
 106  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 107  
 108  #### to_dict
 109  
 110  ```python
 111  to_dict() -> dict[str, Any]
 112  ```
 113  
 114  Serializes the component to a dictionary.
 115  
 116  **Returns:**
 117  
 118  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 119  
 120  #### from_dict
 121  
 122  ```python
 123  from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter
 124  ```
 125  
 126  Deserializes the component from a dictionary.
 127  
 128  **Parameters:**
 129  
 130  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 131  
 132  **Returns:**
 133  
 134  - <code>AzureOCRDocumentConverter</code> – The deserialized component.
 135  
 136  ## csv
 137  
 138  ### CSVToDocument
 139  
 140  Converts CSV files to Documents.
 141  
 142  By default, it uses UTF-8 encoding when converting files but
 143  you can also set a custom encoding.
 144  It can attach metadata to the resulting documents.
 145  
 146  ### Usage example
 147  
 148  ```python
 149  from haystack.components.converters.csv import CSVToDocument
 150  from datetime import datetime
 151  converter = CSVToDocument()
 152  results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
 153  documents = results["documents"]
 154  print(documents[0].content)
 155  # 'col1,col2\nrow1,row1\nrow2,row2\n'
 156  ```
 157  
 158  #### __init__
 159  
 160  ```python
 161  __init__(
 162      encoding: str = "utf-8",
 163      store_full_path: bool = False,
 164      *,
 165      conversion_mode: Literal["file", "row"] = "file",
 166      delimiter: str = ",",
 167      quotechar: str = '"'
 168  ) -> None
 169  ```
 170  
 171  Creates a CSVToDocument component.
 172  
 173  **Parameters:**
 174  
 175  - **encoding** (<code>str</code>) – The encoding of the csv files to convert.
 176    If the encoding is specified in the metadata of a source ByteStream,
 177    it overrides this value.
 178  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 179    If False, only the file name is stored.
 180  - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text.
 181  - "row": convert each CSV row to its own Document (requires `content_column` in `run()`).
 182  - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`).
 183  - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`).
 184  
 185  #### run
 186  
 187  ```python
 188  run(
 189      sources: list[str | Path | ByteStream],
 190      *,
 191      content_column: str | None = None,
 192      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 193  ) -> dict[str, Any]
 194  ```
 195  
 196  Converts CSV files to a Document (file mode) or to one Document per row (row mode).
 197  
 198  **Parameters:**
 199  
 200  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 201  - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`.
 202    The column name whose values become `Document.content` for each row.
 203    The column must exist in the CSV header.
 204  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 205    This value can be either a list of dictionaries or a single dictionary.
 206    If it's a single dictionary, its content is added to the metadata of all produced documents.
 207    If it's a list, the length of the list must match the number of sources, because the two lists will
 208    be zipped.
 209    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 210  
 211  **Returns:**
 212  
 213  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 214  - `documents`: Created documents
 215  
 216  ## docx
 217  
 218  ### DOCXMetadata
 219  
 220  Describes the metadata of Docx file.
 221  
 222  **Parameters:**
 223  
 224  - **author** (<code>str</code>) – The author
 225  - **category** (<code>str</code>) – The category
 226  - **comments** (<code>str</code>) – The comments
 227  - **content_status** (<code>str</code>) – The content status
 228  - **created** (<code>str | None</code>) – The creation date (ISO formatted string)
 229  - **identifier** (<code>str</code>) – The identifier
 230  - **keywords** (<code>str</code>) – Available keywords
 231  - **language** (<code>str</code>) – The language of the document
 232  - **last_modified_by** (<code>str</code>) – User who last modified the document
 233  - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string)
 234  - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string)
 235  - **revision** (<code>int</code>) – The revision number
 236  - **subject** (<code>str</code>) – The subject
 237  - **title** (<code>str</code>) – The title
 238  - **version** (<code>str</code>) – The version
 239  
 240  ### DOCXTableFormat
 241  
 242  Bases: <code>Enum</code>
 243  
 244  Supported formats for storing DOCX tabular data in a Document.
 245  
 246  #### from_str
 247  
 248  ```python
 249  from_str(string: str) -> DOCXTableFormat
 250  ```
 251  
 252  Convert a string to a DOCXTableFormat enum.
 253  
 254  ### DOCXLinkFormat
 255  
 256  Bases: <code>Enum</code>
 257  
 258  Supported formats for storing DOCX link information in a Document.
 259  
 260  #### from_str
 261  
 262  ```python
 263  from_str(string: str) -> DOCXLinkFormat
 264  ```
 265  
 266  Convert a string to a DOCXLinkFormat enum.
 267  
 268  ### DOCXToDocument
 269  
 270  Converts DOCX files to Documents.
 271  
 272  Uses `python-docx` library to convert the DOCX file to a document.
 273  This component does not preserve page breaks in the original document.
 274  
 275  Usage example:
 276  
 277  ```python
 278  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 279  
 280  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 281  results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
 282  documents = results["documents"]
 283  print(documents[0].content)
 284  # 'This is a text from the DOCX file.'
 285  ```
 286  
 287  #### __init__
 288  
 289  ```python
 290  __init__(
 291      table_format: str | DOCXTableFormat = DOCXTableFormat.CSV,
 292      link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE,
 293      store_full_path: bool = False,
 294  ) -> None
 295  ```
 296  
 297  Create a DOCXToDocument component.
 298  
 299  **Parameters:**
 300  
 301  - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 302    DOCXTableFormat.CSV, "markdown", or "csv".
 303  - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either:
 304    DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 305    DOCXLinkFormat.PLAIN or "plain" to get text (address),
 306    DOCXLinkFormat.NONE or "none" to get text without links.
 307  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 308    If False, only the file name is stored.
 309  
 310  #### to_dict
 311  
 312  ```python
 313  to_dict() -> dict[str, Any]
 314  ```
 315  
 316  Serializes the component to a dictionary.
 317  
 318  **Returns:**
 319  
 320  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 321  
 322  #### from_dict
 323  
 324  ```python
 325  from_dict(data: dict[str, Any]) -> DOCXToDocument
 326  ```
 327  
 328  Deserializes the component from a dictionary.
 329  
 330  **Parameters:**
 331  
 332  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 333  
 334  **Returns:**
 335  
 336  - <code>DOCXToDocument</code> – The deserialized component.
 337  
 338  #### run
 339  
 340  ```python
 341  run(
 342      sources: list[str | Path | ByteStream],
 343      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 344  ) -> dict[str, Any]
 345  ```
 346  
 347  Converts DOCX files to Documents.
 348  
 349  **Parameters:**
 350  
 351  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 352  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 353    This value can be either a list of dictionaries or a single dictionary.
 354    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 355    If it's a list, the length of the list must match the number of sources, because the two lists will
 356    be zipped.
 357    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 358  
 359  **Returns:**
 360  
 361  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 362  - `documents`: Created Documents
 363  
 364  ## file_to_file_content
 365  
 366  ### FileToFileContent
 367  
 368  Converts files to FileContent objects to be included in ChatMessage objects.
 369  
 370  ### Usage example
 371  
 372  ```python
 373  from haystack.components.converters import FileToFileContent
 374  
 375  converter = FileToFileContent()
 376  
 377  sources = ["document.pdf", "video.mp4"]
 378  
 379  file_contents = converter.run(sources=sources)["file_contents"]
 380  print(file_contents)
 381  
 382  # [FileContent(base64_data='...',
 383  #              mime_type='application/pdf',
 384  #              filename='document.pdf',
 385  #              extra={}),
 386  #  ...]
 387  ```
 388  
 389  #### run
 390  
 391  ```python
 392  run(
 393      sources: list[str | Path | ByteStream],
 394      *,
 395      extra: dict[str, Any] | list[dict[str, Any]] | None = None
 396  ) -> dict[str, list[FileContent]]
 397  ```
 398  
 399  Converts files to FileContent objects.
 400  
 401  **Parameters:**
 402  
 403  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 404  - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific
 405    information.
 406    To avoid serialization issues, values should be JSON serializable.
 407    This value can be a list of dictionaries or a single dictionary.
 408    If it's a single dictionary, its content is added to the extra of all produced FileContent objects.
 409    If it's a list, its length must match the number of sources as they're zipped together.
 410  
 411  **Returns:**
 412  
 413  - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys:
 414  - `file_contents`: A list of FileContent objects.
 415  
 416  ## html
 417  
 418  ### HTMLToDocument
 419  
 420  Converts an HTML file to a Document.
 421  
 422  Usage example:
 423  
 424  ```python
 425  from haystack.components.converters import HTMLToDocument
 426  
 427  converter = HTMLToDocument()
 428  results = converter.run(sources=["path/to/sample.html"])
 429  documents = results["documents"]
 430  print(documents[0].content)
 431  # 'This is a text from the HTML file.'
 432  ```
 433  
 434  #### __init__
 435  
 436  ```python
 437  __init__(
 438      extraction_kwargs: dict[str, Any] | None = None,
 439      store_full_path: bool = False,
 440  ) -> None
 441  ```
 442  
 443  Create an HTMLToDocument component.
 444  
 445  **Parameters:**
 446  
 447  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These
 448    are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 449    the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 450  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 451    If False, only the file name is stored.
 452  
 453  #### to_dict
 454  
 455  ```python
 456  to_dict() -> dict[str, Any]
 457  ```
 458  
 459  Serializes the component to a dictionary.
 460  
 461  **Returns:**
 462  
 463  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 464  
 465  #### from_dict
 466  
 467  ```python
 468  from_dict(data: dict[str, Any]) -> HTMLToDocument
 469  ```
 470  
 471  Deserializes the component from a dictionary.
 472  
 473  **Parameters:**
 474  
 475  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 476  
 477  **Returns:**
 478  
 479  - <code>HTMLToDocument</code> – The deserialized component.
 480  
 481  #### run
 482  
 483  ```python
 484  run(
 485      sources: list[str | Path | ByteStream],
 486      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 487      extraction_kwargs: dict[str, Any] | None = None,
 488  ) -> dict[str, Any]
 489  ```
 490  
 491  Converts a list of HTML files to Documents.
 492  
 493  **Parameters:**
 494  
 495  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
 496  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 497    This value can be either a list of dictionaries or a single dictionary.
 498    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 499    If it's a list, the length of the list must match the number of sources, because the two lists will
 500    be zipped.
 501    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 502  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process.
 503  
 504  **Returns:**
 505  
 506  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 507  - `documents`: Created Documents
 508  
 509  ## image/document_to_image
 510  
 511  ### DocumentToImageContent
 512  
 513  Converts documents sourced from PDF and image files into ImageContents.
 514  
 515  This component processes a list of documents and extracts visual content from supported file formats, converting
 516  them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF
 517  documents by extracting specific pages as images.
 518  
 519  Documents are expected to have metadata containing:
 520  
 521  - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path`
 522  - A supported image format (MIME type must be one of the supported image types)
 523  - For PDF files, a `page_number` key specifying which page to extract
 524  
 525  ### Usage example
 526  
 527  ````
 528  ```python
 529  from haystack import Document
 530  from haystack.components.converters.image.document_to_image import DocumentToImageContent
 531  
 532  converter = DocumentToImageContent(
 533      file_path_meta_field="file_path",
 534      root_path="/data/files",
 535      detail="high",
 536      size=(800, 600)
 537  )
 538  
 539  documents = [
 540      Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}),
 541      Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1})
 542  ]
 543  
 544  result = converter.run(documents)
 545  image_contents = result["image_contents"]
 546  # [ImageContent(
 547  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'}
 548  #  ),
 549  #  ImageContent(
 550  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high',
 551  #    meta={'page_number': 1, 'file_path': 'doc.pdf'}
 552  #  )]
 553  ```
 554  ````
 555  
 556  #### __init__
 557  
 558  ```python
 559  __init__(
 560      *,
 561      file_path_meta_field: str = "file_path",
 562      root_path: str | None = None,
 563      detail: Literal["auto", "high", "low"] | None = None,
 564      size: tuple[int, int] | None = None
 565  ) -> None
 566  ```
 567  
 568  Initialize the DocumentToImageContent component.
 569  
 570  **Parameters:**
 571  
 572  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
 573  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
 574    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 575  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
 576    This will be passed to the created ImageContent objects.
 577  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 578    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 579    when working with models that have resolution constraints or when transmitting images to remote services.
 580  
 581  #### run
 582  
 583  ```python
 584  run(documents: list[Document]) -> dict[str, list[ImageContent | None]]
 585  ```
 586  
 587  Convert documents with image or PDF sources into ImageContent objects.
 588  
 589  This method processes the input documents, extracting images from supported file formats and converting them
 590  into ImageContent objects.
 591  
 592  **Parameters:**
 593  
 594  - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum
 595    a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which
 596    page to convert.
 597  
 598  **Returns:**
 599  
 600  - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key:
 601  - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image
 602    data and metadata. The order corresponds to order of input documents.
 603  
 604  **Raises:**
 605  
 606  - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported
 607    MIME type. The error message will specify which document and what information is missing or incorrect.
 608  
 609  ## image/file_to_document
 610  
 611  ### ImageFileToDocument
 612  
 613  Converts image file references into empty Document objects with associated metadata.
 614  
 615  This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be
 616  processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`.
 617  
 618  It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as
 619  their content and attaches metadata such as file path and any user-provided values.
 620  
 621  ### Usage example
 622  
 623  ```python
 624  from haystack.components.converters.image import ImageFileToDocument
 625  
 626  converter = ImageFileToDocument()
 627  
 628  sources = ["image.jpg", "another_image.png"]
 629  
 630  result = converter.run(sources=sources)
 631  documents = result["documents"]
 632  
 633  print(documents)
 634  
 635  # [Document(id=..., meta: {'file_path': 'image.jpg'}),
 636  # Document(id=..., meta: {'file_path': 'another_image.png'})]
 637  ```
 638  
 639  #### __init__
 640  
 641  ```python
 642  __init__(*, store_full_path: bool = False) -> None
 643  ```
 644  
 645  Initialize the ImageFileToDocument component.
 646  
 647  **Parameters:**
 648  
 649  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 650    If False, only the file name is stored.
 651  
 652  #### run
 653  
 654  ```python
 655  run(
 656      *,
 657      sources: list[str | Path | ByteStream],
 658      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 659  ) -> dict[str, list[Document]]
 660  ```
 661  
 662  Convert image files into empty Document objects with metadata.
 663  
 664  This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects
 665  without content. These documents are enriched with metadata derived from the input source and optional
 666  user-provided metadata.
 667  
 668  **Parameters:**
 669  
 670  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 671  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 672    This value can be a list of dictionaries or a single dictionary.
 673    If it's a single dictionary, its content is added to the metadata of all produced documents.
 674    If it's a list, its length must match the number of sources, as they are zipped together.
 675    For ByteStream objects, their `meta` is added to the output documents.
 676  
 677  **Returns:**
 678  
 679  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing:
 680  - `documents`: A list of `Document` objects with empty content and associated metadata.
 681  
 682  ## image/file_to_image
 683  
 684  ### ImageFileToImageContent
 685  
 686  Converts image files to ImageContent objects.
 687  
 688  ### Usage example
 689  
 690  ```python
 691  from haystack.components.converters.image import ImageFileToImageContent
 692  
 693  converter = ImageFileToImageContent()
 694  
 695  sources = ["image.jpg", "another_image.png"]
 696  
 697  image_contents = converter.run(sources=sources)["image_contents"]
 698  print(image_contents)
 699  
 700  # [ImageContent(base64_image='...',
 701  #               mime_type='image/jpeg',
 702  #               detail=None,
 703  #               meta={'file_path': 'image.jpg'}),
 704  #  ...]
 705  ```
 706  
 707  #### __init__
 708  
 709  ```python
 710  __init__(
 711      *,
 712      detail: Literal["auto", "high", "low"] | None = None,
 713      size: tuple[int, int] | None = None
 714  ) -> None
 715  ```
 716  
 717  Create the ImageFileToImageContent component.
 718  
 719  **Parameters:**
 720  
 721  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 722    This will be passed to the created ImageContent objects.
 723  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 724    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 725    when working with models that have resolution constraints or when transmitting images to remote services.
 726  
 727  #### run
 728  
 729  ```python
 730  run(
 731      sources: list[str | Path | ByteStream],
 732      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 733      *,
 734      detail: Literal["auto", "high", "low"] | None = None,
 735      size: tuple[int, int] | None = None
 736  ) -> dict[str, list[ImageContent]]
 737  ```
 738  
 739  Converts files to ImageContent objects.
 740  
 741  **Parameters:**
 742  
 743  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 744  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 745    This value can be a list of dictionaries or a single dictionary.
 746    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 747    If it's a list, its length must match the number of sources as they're zipped together.
 748    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 749  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 750    This will be passed to the created ImageContent objects.
 751    If not provided, the detail level will be the one set in the constructor.
 752  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 753    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 754    when working with models that have resolution constraints or when transmitting images to remote services.
 755    If not provided, the size value will be the one set in the constructor.
 756  
 757  **Returns:**
 758  
 759  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 760  - `image_contents`: A list of ImageContent objects.
 761  
 762  ## image/pdf_to_image
 763  
 764  ### PDFToImageContent
 765  
 766  Converts PDF files to ImageContent objects.
 767  
 768  ### Usage example
 769  
 770  ```python
 771  from haystack.components.converters.image import PDFToImageContent
 772  
 773  converter = PDFToImageContent()
 774  
 775  sources = ["file.pdf", "another_file.pdf"]
 776  
 777  image_contents = converter.run(sources=sources)["image_contents"]
 778  print(image_contents)
 779  
 780  # [ImageContent(base64_image='...',
 781  #               mime_type='application/pdf',
 782  #               detail=None,
 783  #               meta={'file_path': 'file.pdf', 'page_number': 1}),
 784  #  ...]
 785  ```
 786  
 787  #### __init__
 788  
 789  ```python
 790  __init__(
 791      *,
 792      detail: Literal["auto", "high", "low"] | None = None,
 793      size: tuple[int, int] | None = None,
 794      page_range: list[str | int] | None = None
 795  ) -> None
 796  ```
 797  
 798  Create the PDFToImageContent component.
 799  
 800  **Parameters:**
 801  
 802  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 803    This will be passed to the created ImageContent objects.
 804  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 805    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 806    when working with models that have resolution constraints or when transmitting images to remote services.
 807  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 808    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 809    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 810    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 811    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 812  
 813  #### run
 814  
 815  ```python
 816  run(
 817      sources: list[str | Path | ByteStream],
 818      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 819      *,
 820      detail: Literal["auto", "high", "low"] | None = None,
 821      size: tuple[int, int] | None = None,
 822      page_range: list[str | int] | None = None
 823  ) -> dict[str, list[ImageContent]]
 824  ```
 825  
 826  Converts files to ImageContent objects.
 827  
 828  **Parameters:**
 829  
 830  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 831  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 832    This value can be a list of dictionaries or a single dictionary.
 833    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 834    If it's a list, its length must match the number of sources as they're zipped together.
 835    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 836  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 837    This will be passed to the created ImageContent objects.
 838    If not provided, the detail level will be the one set in the constructor.
 839  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 840    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 841    when working with models that have resolution constraints or when transmitting images to remote services.
 842    If not provided, the size value will be the one set in the constructor.
 843  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 844    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 845    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 846    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 847    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 848    If not provided, the page_range value will be the one set in the constructor.
 849  
 850  **Returns:**
 851  
 852  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 853  - `image_contents`: A list of ImageContent objects.
 854  
 855  ## json
 856  
 857  ### JSONConverter
 858  
 859  Converts one or more JSON files into a text document.
 860  
 861  ### Usage examples
 862  
 863  ```python
 864  import json
 865  
 866  from haystack.components.converters import JSONConverter
 867  from haystack.dataclasses import ByteStream
 868  
 869  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 870  
 871  converter = JSONConverter(content_key="text")
 872  results = converter.run(sources=[source])
 873  documents = results["documents"]
 874  print(documents[0].content)
 875  # 'This is the content of my document'
 876  ```
 877  
 878  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 879  to extract from the filtered data:
 880  
 881  ```python
 882  import json
 883  
 884  from haystack.components.converters import JSONConverter
 885  from haystack.dataclasses import ByteStream
 886  
 887  data = {
 888      "laureates": [
 889          {
 890              "firstname": "Enrico",
 891              "surname": "Fermi",
 892              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 893              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 894              " slow neutrons",
 895          },
 896          {
 897              "firstname": "Rita",
 898              "surname": "Levi-Montalcini",
 899              "motivation": "for their discoveries of growth factors",
 900          },
 901      ],
 902  }
 903  source = ByteStream.from_string(json.dumps(data))
 904  converter = JSONConverter(
 905      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 906  )
 907  
 908  results = converter.run(sources=[source])
 909  documents = results["documents"]
 910  print(documents[0].content)
 911  # 'for his demonstrations of the existence of new radioactive elements produced by
 912  # neutron irradiation, and for his related discovery of nuclear reactions brought
 913  # about by slow neutrons'
 914  
 915  print(documents[0].meta)
 916  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 917  
 918  print(documents[1].content)
 919  # 'for their discoveries of growth factors'
 920  
 921  print(documents[1].meta)
 922  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 923  ```
 924  
 925  #### __init__
 926  
 927  ```python
 928  __init__(
 929      jq_schema: str | None = None,
 930      content_key: str | None = None,
 931      extra_meta_fields: set[str] | Literal["*"] | None = None,
 932      store_full_path: bool = False,
 933  ) -> None
 934  ```
 935  
 936  Creates a JSONConverter component.
 937  
 938  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 939  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 940  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 941  
 942  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 943  be set as the document's content.
 944  
 945  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 946  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 947  
 948  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 949  it will be skipped.
 950  
 951  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 952  
 953  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 954  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 955  the extracted documents. If a field is not found, the meta value will be `None`.
 956  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 957  be saved as metadata.
 958  
 959  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 960  
 961  **Parameters:**
 962  
 963  - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content.
 964    If not specified, whole JSON object will be used to extract information.
 965  - **content_key** (<code>str | None</code>) – Optional key to extract document content.
 966    If `jq_schema` is specified, the `content_key` will be extracted from that object.
 967  - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content.
 968    If `jq_schema` is specified, all keys will be extracted from that object.
 969  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 970    If False, only the file name is stored.
 971  
 972  #### to_dict
 973  
 974  ```python
 975  to_dict() -> dict[str, Any]
 976  ```
 977  
 978  Serializes the component to a dictionary.
 979  
 980  **Returns:**
 981  
 982  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 983  
 984  #### from_dict
 985  
 986  ```python
 987  from_dict(data: dict[str, Any]) -> JSONConverter
 988  ```
 989  
 990  Deserializes the component from a dictionary.
 991  
 992  **Parameters:**
 993  
 994  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 995  
 996  **Returns:**
 997  
 998  - <code>JSONConverter</code> – Deserialized component.
 999  
1000  #### run
1001  
1002  ```python
1003  run(
1004      sources: list[str | Path | ByteStream],
1005      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1006  ) -> dict[str, Any]
1007  ```
1008  
1009  Converts a list of JSON files to documents.
1010  
1011  **Parameters:**
1012  
1013  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects.
1014  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1015    This value can be either a list of dictionaries or a single dictionary.
1016    If it's a single dictionary, its content is added to the metadata of all produced documents.
1017    If it's a list, the length of the list must match the number of sources.
1018    If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
1019  
1020  **Returns:**
1021  
1022  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1023  - `documents`: A list of created documents.
1024  
1025  ## markdown
1026  
1027  ### MarkdownToDocument
1028  
1029  Converts a Markdown file into a text Document.
1030  
1031  Usage example:
1032  
1033  ```python
1034  from haystack.components.converters import MarkdownToDocument
1035  from datetime import datetime
1036  
1037  converter = MarkdownToDocument()
1038  results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
1039  documents = results["documents"]
1040  print(documents[0].content)
1041  # 'This is a text from the markdown file.'
1042  ```
1043  
1044  #### __init__
1045  
1046  ```python
1047  __init__(
1048      table_to_single_line: bool = False,
1049      progress_bar: bool = True,
1050      store_full_path: bool = False,
1051  ) -> None
1052  ```
1053  
1054  Create a MarkdownToDocument component.
1055  
1056  **Parameters:**
1057  
1058  - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line.
1059  - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running.
1060  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1061    If False, only the file name is stored.
1062  
1063  #### run
1064  
1065  ```python
1066  run(
1067      sources: list[str | Path | ByteStream],
1068      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1069  ) -> dict[str, Any]
1070  ```
1071  
1072  Converts a list of Markdown files to Documents.
1073  
1074  **Parameters:**
1075  
1076  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1077  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1078    This value can be either a list of dictionaries or a single dictionary.
1079    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1080    If it's a list, the length of the list must match the number of sources, because the two lists will
1081    be zipped.
1082    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1083  
1084  **Returns:**
1085  
1086  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1087  - `documents`: List of created Documents
1088  
1089  ## msg
1090  
1091  ### MSGToDocument
1092  
1093  Converts Microsoft Outlook .msg files into Haystack Documents.
1094  
1095  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
1096  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
1097  file are extracted as ByteStream objects.
1098  
1099  ### Example Usage
1100  
1101  ```python
1102  from haystack.components.converters.msg import MSGToDocument
1103  from datetime import datetime
1104  
1105  converter = MSGToDocument()
1106  results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
1107  documents = results["documents"]
1108  attachments = results["attachments"]
1109  print(documents[0].content)
1110  ```
1111  
1112  #### __init__
1113  
1114  ```python
1115  __init__(store_full_path: bool = False) -> None
1116  ```
1117  
1118  Creates a MSGToDocument component.
1119  
1120  **Parameters:**
1121  
1122  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1123    If False, only the file name is stored.
1124  
1125  #### run
1126  
1127  ```python
1128  run(
1129      sources: list[str | Path | ByteStream],
1130      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1131  ) -> dict[str, list[Document] | list[ByteStream]]
1132  ```
1133  
1134  Converts MSG files to Documents.
1135  
1136  **Parameters:**
1137  
1138  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1139  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1140    This value can be either a list of dictionaries or a single dictionary.
1141    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1142    If it's a list, the length of the list must match the number of sources, because the two lists will
1143    be zipped.
1144    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1145  
1146  **Returns:**
1147  
1148  - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys:
1149  - `documents`: Created Documents.
1150  - `attachments`: Created ByteStream objects from file attachments.
1151  
1152  ## multi_file_converter
1153  
1154  ### MultiFileConverter
1155  
1156  A file converter that handles conversion of multiple file types.
1157  
1158  The MultiFileConverter handles the following file types:
1159  
1160  - CSV
1161  - DOCX
1162  - HTML
1163  - JSON
1164  - MD
1165  - TEXT
1166  - PDF (no OCR)
1167  - PPTX
1168  - XLSX
1169  
1170  Usage example:
1171  
1172  ```
1173  from haystack.super_components.converters import MultiFileConverter
1174  
1175  converter = MultiFileConverter()
1176  converter.run(sources=["test.txt", "test.pdf"], meta={})
1177  ```
1178  
1179  #### __init__
1180  
1181  ```python
1182  __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None
1183  ```
1184  
1185  Initialize the MultiFileConverter.
1186  
1187  **Parameters:**
1188  
1189  - **encoding** (<code>str</code>) – The encoding to use when reading files.
1190  - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files.
1191  
1192  ## openapi_functions
1193  
1194  ### OpenAPIServiceToFunctions
1195  
1196  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
1197  
1198  The definition must respect OpenAPI specification 3.0.0 or higher.
1199  It can be specified in JSON or YAML format.
1200  Each function must have:
1201  \- unique operationId
1202  \- description
1203  \- requestBody and/or parameters
1204  \- schema for the requestBody and/or parameters
1205  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
1206  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
1207  
1208  Usage example:
1209  
1210  ```python
1211  from haystack.components.converters import OpenAPIServiceToFunctions
1212  from haystack.dataclasses.byte_stream import ByteStream
1213  
1214  converter = OpenAPIServiceToFunctions()
1215  spec = ByteStream.from_string(
1216      '{"openapi":"3.0.0","info":{"title":"API","version":"1.0.0"},"paths":{"/search":{"get":{"operationId":"search","summary":"Search","parameters":[{"name":"q","in":"query","required":true,"schema":{"type":"string"}}]}}}}'
1217  )
1218  result = converter.run(sources=[spec])
1219  assert result["functions"]
1220  ```
1221  
1222  #### __init__
1223  
1224  ```python
1225  __init__() -> None
1226  ```
1227  
1228  Create an OpenAPIServiceToFunctions component.
1229  
1230  #### run
1231  
1232  ```python
1233  run(sources: list[str | Path | ByteStream]) -> dict[str, Any]
1234  ```
1235  
1236  Converts OpenAPI definitions in OpenAI function calling format.
1237  
1238  **Parameters:**
1239  
1240  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
1241  
1242  **Returns:**
1243  
1244  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1245  - functions: Function definitions in JSON object format
1246  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
1247  
1248  **Raises:**
1249  
1250  - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed.
1251  - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions.
1252  
1253  ## output_adapter
1254  
1255  ### OutputAdaptationException
1256  
1257  Bases: <code>Exception</code>
1258  
1259  Exception raised when there is an error during output adaptation.
1260  
1261  ### OutputAdapter
1262  
1263  Adapts output of a Component using Jinja templates.
1264  
1265  Usage example:
1266  
1267  ```python
1268  from haystack import Document
1269  from haystack.components.converters import OutputAdapter
1270  
1271  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
1272  documents = [Document(content="Test content")]
1273  result = adapter.run(documents=documents)
1274  
1275  assert result["output"] == "Test content"
1276  ```
1277  
1278  #### __init__
1279  
1280  ```python
1281  __init__(
1282      template: str,
1283      output_type: TypeAlias,
1284      custom_filters: dict[str, Callable] | None = None,
1285      unsafe: bool = False,
1286  ) -> None
1287  ```
1288  
1289  Create an OutputAdapter component.
1290  
1291  **Parameters:**
1292  
1293  - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data.
1294    The variables in the template define the input of this instance.
1295    e.g.
1296    With this template:
1297  
1298  ```
1299  {{ documents[0].content }}
1300  ```
1301  
1302  The Component input will be `documents`.
1303  
1304  - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return.
1305  - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template.
1306  - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template.
1307    This should only be used if you trust the source of the template as it can be lead to remote code execution.
1308  
1309  #### run
1310  
1311  ```python
1312  run(**kwargs: Any) -> dict[str, Any]
1313  ```
1314  
1315  Renders the Jinja template with the provided inputs.
1316  
1317  **Parameters:**
1318  
1319  - **kwargs** (<code>Any</code>) – Must contain all variables used in the `template` string.
1320  
1321  **Returns:**
1322  
1323  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1324  - `output`: Rendered Jinja template.
1325  
1326  **Raises:**
1327  
1328  - <code>OutputAdaptationException</code> – If template rendering fails.
1329  
1330  #### to_dict
1331  
1332  ```python
1333  to_dict() -> dict[str, Any]
1334  ```
1335  
1336  Serializes the component to a dictionary.
1337  
1338  **Returns:**
1339  
1340  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1341  
1342  #### from_dict
1343  
1344  ```python
1345  from_dict(data: dict[str, Any]) -> OutputAdapter
1346  ```
1347  
1348  Deserializes the component from a dictionary.
1349  
1350  **Parameters:**
1351  
1352  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
1353  
1354  **Returns:**
1355  
1356  - <code>OutputAdapter</code> – The deserialized component.
1357  
1358  ## pdfminer
1359  
1360  ### PDFMinerToDocument
1361  
1362  Converts PDF files to Documents.
1363  
1364  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1365  
1366  Usage example:
1367  
1368  ```python
1369  from haystack.components.converters.pdfminer import PDFMinerToDocument
1370  
1371  converter = PDFMinerToDocument()
1372  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1373  documents = results["documents"]
1374  print(documents[0].content)
1375  # 'This is a text from the PDF file.'
1376  ```
1377  
1378  #### __init__
1379  
1380  ```python
1381  __init__(
1382      line_overlap: float = 0.5,
1383      char_margin: float = 2.0,
1384      line_margin: float = 0.5,
1385      word_margin: float = 0.1,
1386      boxes_flow: float | None = 0.5,
1387      detect_vertical: bool = True,
1388      all_texts: bool = False,
1389      store_full_path: bool = False,
1390  ) -> None
1391  ```
1392  
1393  Create a PDFMinerToDocument component.
1394  
1395  **Parameters:**
1396  
1397  - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on
1398    the same line based on the amount of overlap between them.
1399    The overlap is calculated relative to the minimum height of both characters.
1400  - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them.
1401    If the distance is less than the margin specified, the characters are considered to be on the same line.
1402    The margin is calculated relative to the width of the character.
1403  - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word
1404    based on the distance between them. If the distance is greater than the margin specified,
1405    an intermediate space will be added between them to make the text more readable.
1406    The margin is calculated relative to the width of the character.
1407  - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on
1408    the distance between them. If the distance is less than the margin specified,
1409    the lines are considered to be part of the same paragraph.
1410    The margin is calculated relative to the height of a line.
1411  - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when
1412    determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1413    with -1.0 indicating that only horizontal position matters and +1.0 indicating
1414    that only vertical position matters. Setting the value to 'None' will disable advanced
1415    layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1416  - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis.
1417  - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures.
1418  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1419    If False, only the file name is stored.
1420  
1421  #### detect_undecoded_cid_characters
1422  
1423  ```python
1424  detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1425  ```
1426  
1427  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1428  
1429  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1430  non-standard fonts.
1431  
1432  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1433  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1434  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1435  as is.
1436  
1437  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1438  
1439  **Parameters:**
1440  
1441  - **text** (<code>str</code>) – The text to check for undecoded CID characters
1442  
1443  **Returns:**
1444  
1445  - <code>dict\[str, Any\]</code> – A dictionary containing detection results
1446  
1447  #### run
1448  
1449  ```python
1450  run(
1451      sources: list[str | Path | ByteStream],
1452      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1453  ) -> dict[str, Any]
1454  ```
1455  
1456  Converts PDF files to Documents.
1457  
1458  **Parameters:**
1459  
1460  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects.
1461  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1462    This value can be either a list of dictionaries or a single dictionary.
1463    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1464    If it's a list, the length of the list must match the number of sources, because the two lists will
1465    be zipped.
1466    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1467  
1468  **Returns:**
1469  
1470  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1471  - `documents`: Created Documents
1472  
1473  ## pptx
1474  
1475  ### PPTXToDocument
1476  
1477  Converts PPTX files to Documents.
1478  
1479  Usage example:
1480  
1481  ```python
1482  from haystack.components.converters.pptx import PPTXToDocument
1483  
1484  converter = PPTXToDocument()
1485  results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
1486  documents = results["documents"]
1487  print(documents[0].content)
1488  # 'This is the text from the PPTX file.'
1489  ```
1490  
1491  #### __init__
1492  
1493  ```python
1494  __init__(
1495      store_full_path: bool = False,
1496      link_format: Literal["markdown", "plain", "none"] = "none",
1497  ) -> None
1498  ```
1499  
1500  Create a PPTXToDocument component.
1501  
1502  **Parameters:**
1503  
1504  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1505    If False, only the file name is stored.
1506  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1507  - `"markdown"`: `[text](url)`
1508  - `"plain"`: `text (url)`
1509  - `"none"`: Only the text is extracted, link addresses are ignored.
1510  
1511  #### to_dict
1512  
1513  ```python
1514  to_dict() -> dict[str, Any]
1515  ```
1516  
1517  Serializes the component to a dictionary.
1518  
1519  **Returns:**
1520  
1521  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1522  
1523  #### run
1524  
1525  ```python
1526  run(
1527      sources: list[str | Path | ByteStream],
1528      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1529  ) -> dict[str, Any]
1530  ```
1531  
1532  Converts PPTX files to Documents.
1533  
1534  **Parameters:**
1535  
1536  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1537  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1538    This value can be either a list of dictionaries or a single dictionary.
1539    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1540    If it's a list, the length of the list must match the number of sources, because the two lists will
1541    be zipped.
1542    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1543  
1544  **Returns:**
1545  
1546  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1547  - `documents`: Created Documents
1548  
1549  ## pypdf
1550  
1551  ### PyPDFExtractionMode
1552  
1553  Bases: <code>Enum</code>
1554  
1555  The mode to use for extracting text from a PDF.
1556  
1557  #### from_str
1558  
1559  ```python
1560  from_str(string: str) -> PyPDFExtractionMode
1561  ```
1562  
1563  Convert a string to a PyPDFExtractionMode enum.
1564  
1565  ### PyPDFToDocument
1566  
1567  Converts PDF files to documents your pipeline can query.
1568  
1569  This component uses the PyPDF library.
1570  You can attach metadata to the resulting documents.
1571  
1572  ### Usage example
1573  
1574  ```python
1575  from haystack.components.converters.pypdf import PyPDFToDocument
1576  
1577  converter = PyPDFToDocument()
1578  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1579  documents = results["documents"]
1580  print(documents[0].content)
1581  # 'This is a text from the PDF file.'
1582  ```
1583  
1584  #### __init__
1585  
1586  ```python
1587  __init__(
1588      *,
1589      extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN,
1590      plain_mode_orientations: tuple = (0, 90, 180, 270),
1591      plain_mode_space_width: float = 200.0,
1592      layout_mode_space_vertically: bool = True,
1593      layout_mode_scale_weight: float = 1.25,
1594      layout_mode_strip_rotated: bool = True,
1595      layout_mode_font_height_weight: float = 1.0,
1596      store_full_path: bool = False
1597  ) -> None
1598  ```
1599  
1600  Create an PyPDFToDocument component.
1601  
1602  **Parameters:**
1603  
1604  - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF.
1605    Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1606  - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode.
1607    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1608  - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font.
1609    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1610  - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height.
1611    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1612  - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width.
1613    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1614  - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1615    If rotated text is discovered, layout will be degraded and a warning will be logged.
1616    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1617  - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height.
1618    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1619  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1620    If False, only the file name is stored.
1621  
1622  #### to_dict
1623  
1624  ```python
1625  to_dict() -> dict[str, Any]
1626  ```
1627  
1628  Serializes the component to a dictionary.
1629  
1630  **Returns:**
1631  
1632  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1633  
1634  #### from_dict
1635  
1636  ```python
1637  from_dict(data: dict[str, Any]) -> PyPDFToDocument
1638  ```
1639  
1640  Deserializes the component from a dictionary.
1641  
1642  **Parameters:**
1643  
1644  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
1645  
1646  **Returns:**
1647  
1648  - <code>PyPDFToDocument</code> – Deserialized component.
1649  
1650  #### run
1651  
1652  ```python
1653  run(
1654      sources: list[str | Path | ByteStream],
1655      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1656  ) -> dict[str, list[Document]]
1657  ```
1658  
1659  Converts PDF files to documents.
1660  
1661  **Parameters:**
1662  
1663  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
1664  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1665    This value can be a list of dictionaries or a single dictionary.
1666    If it's a single dictionary, its content is added to the metadata of all produced documents.
1667    If it's a list, its length must match the number of sources, as they are zipped together.
1668    For ByteStream objects, their `meta` is added to the output documents.
1669  
1670  **Returns:**
1671  
1672  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1673  - `documents`: A list of converted documents.
1674  
1675  ## tika
1676  
1677  ### XHTMLParser
1678  
1679  Bases: <code>HTMLParser</code>
1680  
1681  Custom parser to extract pages from Tika XHTML content.
1682  
1683  #### handle_starttag
1684  
1685  ```python
1686  handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None
1687  ```
1688  
1689  Identify the start of a page div.
1690  
1691  #### handle_endtag
1692  
1693  ```python
1694  handle_endtag(tag: str) -> None
1695  ```
1696  
1697  Identify the end of a page div.
1698  
1699  #### handle_data
1700  
1701  ```python
1702  handle_data(data: str) -> None
1703  ```
1704  
1705  Populate the page content.
1706  
1707  ### TikaDocumentConverter
1708  
1709  Converts files of different types to Documents.
1710  
1711  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1712  requires a running Tika server.
1713  For more options on running Tika,
1714  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1715  
1716  Usage example:
1717  
1718  ```python
1719  from haystack.components.converters.tika import TikaDocumentConverter
1720  
1721  converter = TikaDocumentConverter()
1722  results = converter.run(
1723      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1724      meta={"date_added": datetime.now().isoformat()}
1725  )
1726  documents = results["documents"]
1727  print(documents[0].content)
1728  # 'This is a text from the docx file.'
1729  ```
1730  
1731  #### __init__
1732  
1733  ```python
1734  __init__(
1735      tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
1736  ) -> None
1737  ```
1738  
1739  Create a TikaDocumentConverter component.
1740  
1741  **Parameters:**
1742  
1743  - **tika_url** (<code>str</code>) – Tika server URL.
1744  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1745    If False, only the file name is stored.
1746  
1747  #### run
1748  
1749  ```python
1750  run(
1751      sources: list[str | Path | ByteStream],
1752      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1753  ) -> dict[str, list[Document]]
1754  ```
1755  
1756  Converts files to Documents.
1757  
1758  **Parameters:**
1759  
1760  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
1761  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1762    This value can be either a list of dictionaries or a single dictionary.
1763    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1764    If it's a list, the length of the list must match the number of sources, because the two lists will
1765    be zipped.
1766    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1767  
1768  **Returns:**
1769  
1770  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1771  - `documents`: Created Documents
1772  
1773  ## txt
1774  
1775  ### TextFileToDocument
1776  
1777  Converts text files to documents your pipeline can query.
1778  
1779  By default, it uses UTF-8 encoding when converting files but
1780  you can also set custom encoding.
1781  It can attach metadata to the resulting documents.
1782  
1783  ### Usage example
1784  
1785  ```python
1786  from haystack.components.converters.txt import TextFileToDocument
1787  
1788  converter = TextFileToDocument()
1789  results = converter.run(sources=["sample.txt"])
1790  documents = results["documents"]
1791  print(documents[0].content)
1792  # 'This is the content from the txt file.'
1793  ```
1794  
1795  #### __init__
1796  
1797  ```python
1798  __init__(encoding: str = 'utf-8', store_full_path: bool = False) -> None
1799  ```
1800  
1801  Creates a TextFileToDocument component.
1802  
1803  **Parameters:**
1804  
1805  - **encoding** (<code>str</code>) – The encoding of the text files to convert.
1806    If the encoding is specified in the metadata of a source ByteStream,
1807    it overrides this value.
1808  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1809    If False, only the file name is stored.
1810  
1811  #### run
1812  
1813  ```python
1814  run(
1815      sources: list[str | Path | ByteStream],
1816      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1817  ) -> dict[str, list[Document]]
1818  ```
1819  
1820  Converts text files to documents.
1821  
1822  **Parameters:**
1823  
1824  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert.
1825  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1826    This value can be a list of dictionaries or a single dictionary.
1827    If it's a single dictionary, its content is added to the metadata of all produced documents.
1828    If it's a list, its length must match the number of sources as they're zipped together.
1829    For ByteStream objects, their `meta` is added to the output documents.
1830  
1831  **Returns:**
1832  
1833  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1834  - `documents`: A list of converted documents.
1835  
1836  ## xlsx
1837  
1838  ### XLSXToDocument
1839  
1840  Converts XLSX (Excel) files into Documents.
1841  
1842  Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1843  created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1844  
1845  ### Usage example
1846  
1847  ```python
1848  from haystack.components.converters.xlsx import XLSXToDocument
1849  from datetime import datetime
1850  
1851  converter = XLSXToDocument()
1852  results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
1853  documents = results["documents"]
1854  print(documents[0].content)
1855  # ",A,B\n1,col_a,col_b\n2,1.5,test\n"
1856  ```
1857  
1858  #### __init__
1859  
1860  ```python
1861  __init__(
1862      table_format: Literal["csv", "markdown"] = "csv",
1863      sheet_name: str | int | list[str | int] | None = None,
1864      read_excel_kwargs: dict[str, Any] | None = None,
1865      table_format_kwargs: dict[str, Any] | None = None,
1866      *,
1867      link_format: Literal["markdown", "plain", "none"] = "none",
1868      store_full_path: bool = False
1869  ) -> None
1870  ```
1871  
1872  Creates a XLSXToDocument component.
1873  
1874  **Parameters:**
1875  
1876  - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to.
1877  - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read.
1878  - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`.
1879    See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1880  - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function.
1881  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1882    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1883  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1884    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1885  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1886  - `"markdown"`: `[text](url)`
1887  - `"plain"`: `text (url)`
1888  - `"none"`: Only the text is extracted, link addresses are ignored.
1889  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1890    If False, only the file name is stored.
1891  
1892  #### run
1893  
1894  ```python
1895  run(
1896      sources: list[str | Path | ByteStream],
1897      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1898  ) -> dict[str, list[Document]]
1899  ```
1900  
1901  Converts a XLSX file to a Document.
1902  
1903  **Parameters:**
1904  
1905  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1906  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1907    This value can be either a list of dictionaries or a single dictionary.
1908    If it's a single dictionary, its content is added to the metadata of all produced documents.
1909    If it's a list, the length of the list must match the number of sources, because the two lists will
1910    be zipped.
1911    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1912  
1913  **Returns:**
1914  
1915  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1916  - `documents`: Created documents