Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: "Converters"
   3  id: converters-api
   4  description: "Various converters to transform data from one format to another."
   5  slug: "/converters-api"
   6  ---
   7  
   8  
   9  ## azure
  10  
  11  ### AzureOCRDocumentConverter
  12  
  13  Converts files to documents using Azure's Document Intelligence service.
  14  
  15  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  16  
  17  To use this component, you need an active Azure account
  18  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  19  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  20  
  21  ### Usage example
  22  
  23  ```python
  24  import os
  25  from datetime import datetime
  26  from haystack.components.converters import AzureOCRDocumentConverter
  27  from haystack.utils import Secret
  28  
  29  converter = AzureOCRDocumentConverter(
  30      endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
  31      api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
  32  )
  33  results = converter.run(
  34      sources=["test/test_files/pdf/react_paper.pdf"],
  35      meta={"date_added": datetime.now().isoformat()},
  36  )
  37  documents = results["documents"]
  38  print(documents[0].content)
  39  # 'This is a text from the PDF file.'
  40  ```
  41  
  42  #### __init__
  43  
  44  ```python
  45  __init__(
  46      endpoint: str,
  47      api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  48      model_id: str = "prebuilt-read",
  49      preceding_context_len: int = 3,
  50      following_context_len: int = 3,
  51      merge_multiple_column_headers: bool = True,
  52      page_layout: Literal["natural", "single_column"] = "natural",
  53      threshold_y: float | None = 0.05,
  54      store_full_path: bool = False,
  55  )
  56  ```
  57  
  58  Creates an AzureOCRDocumentConverter component.
  59  
  60  **Parameters:**
  61  
  62  - **endpoint** (<code>str</code>) – The endpoint of your Azure resource.
  63  - **api_key** (<code>Secret</code>) – The API key of your Azure resource.
  64  - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation]
  65    (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  66  - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context
  67    (this will be added to the metadata).
  68  - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context (
  69    this will be added to the metadata).
  70  - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row.
  71  - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options:
  72  - `natural`: Uses the natural reading order determined by Azure.
  73  - `single_column`: Groups all lines with the same height on the page based on a threshold
  74    determined by `threshold_y`.
  75  - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`.
  76    The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  77    single line. This is crucial for section headers or numbers which may be spatially separated
  78    from the remaining text on the horizontal axis.
  79  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
  80    If False, only the file name is stored.
  81  
  82  #### run
  83  
  84  ```python
  85  run(
  86      sources: list[str | Path | ByteStream],
  87      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
  88  )
  89  ```
  90  
  91  Convert a list of files to Documents using Azure's Document Intelligence service.
  92  
  93  **Parameters:**
  94  
  95  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
  96  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
  97    This value can be either a list of dictionaries or a single dictionary.
  98    If it's a single dictionary, its content is added to the metadata of all produced Documents.
  99    If it's a list, the length of the list must match the number of sources, because the two lists will be
 100    zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 101  
 102  **Returns:**
 103  
 104  - – A dictionary with the following keys:
 105  - `documents`: List of created Documents
 106  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 107  
 108  #### to_dict
 109  
 110  ```python
 111  to_dict() -> dict[str, Any]
 112  ```
 113  
 114  Serializes the component to a dictionary.
 115  
 116  **Returns:**
 117  
 118  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 119  
 120  #### from_dict
 121  
 122  ```python
 123  from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter
 124  ```
 125  
 126  Deserializes the component from a dictionary.
 127  
 128  **Parameters:**
 129  
 130  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 131  
 132  **Returns:**
 133  
 134  - <code>AzureOCRDocumentConverter</code> – The deserialized component.
 135  
 136  ## csv
 137  
 138  ### CSVToDocument
 139  
 140  Converts CSV files to Documents.
 141  
 142  By default, it uses UTF-8 encoding when converting files but
 143  you can also set a custom encoding.
 144  It can attach metadata to the resulting documents.
 145  
 146  ### Usage example
 147  
 148  ```python
 149  from haystack.components.converters.csv import CSVToDocument
 150  converter = CSVToDocument()
 151  results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
 152  documents = results["documents"]
 153  print(documents[0].content)
 154  # 'col1,col2\nrow1,row1\nrow2,row2\n'
 155  ```
 156  
 157  #### __init__
 158  
 159  ```python
 160  __init__(
 161      encoding: str = "utf-8",
 162      store_full_path: bool = False,
 163      *,
 164      conversion_mode: Literal["file", "row"] = "file",
 165      delimiter: str = ",",
 166      quotechar: str = '"'
 167  )
 168  ```
 169  
 170  Creates a CSVToDocument component.
 171  
 172  **Parameters:**
 173  
 174  - **encoding** (<code>str</code>) – The encoding of the csv files to convert.
 175    If the encoding is specified in the metadata of a source ByteStream,
 176    it overrides this value.
 177  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 178    If False, only the file name is stored.
 179  - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text.
 180  - "row": convert each CSV row to its own Document (requires `content_column` in `run()`).
 181  - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`).
 182  - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`).
 183  
 184  #### run
 185  
 186  ```python
 187  run(
 188      sources: list[str | Path | ByteStream],
 189      *,
 190      content_column: str | None = None,
 191      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 192  )
 193  ```
 194  
 195  Converts CSV files to a Document (file mode) or to one Document per row (row mode).
 196  
 197  **Parameters:**
 198  
 199  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 200  - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`.
 201    The column name whose values become `Document.content` for each row.
 202    The column must exist in the CSV header.
 203  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 204    This value can be either a list of dictionaries or a single dictionary.
 205    If it's a single dictionary, its content is added to the metadata of all produced documents.
 206    If it's a list, the length of the list must match the number of sources, because the two lists will
 207    be zipped.
 208    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 209  
 210  **Returns:**
 211  
 212  - – A dictionary with the following keys:
 213  - `documents`: Created documents
 214  
 215  ## docx
 216  
 217  ### DOCXMetadata
 218  
 219  Describes the metadata of Docx file.
 220  
 221  **Parameters:**
 222  
 223  - **author** (<code>str</code>) – The author
 224  - **category** (<code>str</code>) – The category
 225  - **comments** (<code>str</code>) – The comments
 226  - **content_status** (<code>str</code>) – The content status
 227  - **created** (<code>str | None</code>) – The creation date (ISO formatted string)
 228  - **identifier** (<code>str</code>) – The identifier
 229  - **keywords** (<code>str</code>) – Available keywords
 230  - **language** (<code>str</code>) – The language of the document
 231  - **last_modified_by** (<code>str</code>) – User who last modified the document
 232  - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string)
 233  - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string)
 234  - **revision** (<code>int</code>) – The revision number
 235  - **subject** (<code>str</code>) – The subject
 236  - **title** (<code>str</code>) – The title
 237  - **version** (<code>str</code>) – The version
 238  
 239  ### DOCXTableFormat
 240  
 241  Bases: <code>Enum</code>
 242  
 243  Supported formats for storing DOCX tabular data in a Document.
 244  
 245  #### from_str
 246  
 247  ```python
 248  from_str(string: str) -> DOCXTableFormat
 249  ```
 250  
 251  Convert a string to a DOCXTableFormat enum.
 252  
 253  ### DOCXLinkFormat
 254  
 255  Bases: <code>Enum</code>
 256  
 257  Supported formats for storing DOCX link information in a Document.
 258  
 259  #### from_str
 260  
 261  ```python
 262  from_str(string: str) -> DOCXLinkFormat
 263  ```
 264  
 265  Convert a string to a DOCXLinkFormat enum.
 266  
 267  ### DOCXToDocument
 268  
 269  Converts DOCX files to Documents.
 270  
 271  Uses `python-docx` library to convert the DOCX file to a document.
 272  This component does not preserve page breaks in the original document.
 273  
 274  Usage example:
 275  
 276  ```python
 277  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 278  
 279  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 280  results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
 281  documents = results["documents"]
 282  print(documents[0].content)
 283  # 'This is a text from the DOCX file.'
 284  ```
 285  
 286  #### __init__
 287  
 288  ```python
 289  __init__(
 290      table_format: str | DOCXTableFormat = DOCXTableFormat.CSV,
 291      link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE,
 292      store_full_path: bool = False,
 293  )
 294  ```
 295  
 296  Create a DOCXToDocument component.
 297  
 298  **Parameters:**
 299  
 300  - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 301    DOCXTableFormat.CSV, "markdown", or "csv".
 302  - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either:
 303    DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 304    DOCXLinkFormat.PLAIN or "plain" to get text (address),
 305    DOCXLinkFormat.NONE or "none" to get text without links.
 306  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 307    If False, only the file name is stored.
 308  
 309  #### to_dict
 310  
 311  ```python
 312  to_dict() -> dict[str, Any]
 313  ```
 314  
 315  Serializes the component to a dictionary.
 316  
 317  **Returns:**
 318  
 319  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 320  
 321  #### from_dict
 322  
 323  ```python
 324  from_dict(data: dict[str, Any]) -> DOCXToDocument
 325  ```
 326  
 327  Deserializes the component from a dictionary.
 328  
 329  **Parameters:**
 330  
 331  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 332  
 333  **Returns:**
 334  
 335  - <code>DOCXToDocument</code> – The deserialized component.
 336  
 337  #### run
 338  
 339  ```python
 340  run(
 341      sources: list[str | Path | ByteStream],
 342      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 343  )
 344  ```
 345  
 346  Converts DOCX files to Documents.
 347  
 348  **Parameters:**
 349  
 350  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 351  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 352    This value can be either a list of dictionaries or a single dictionary.
 353    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 354    If it's a list, the length of the list must match the number of sources, because the two lists will
 355    be zipped.
 356    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 357  
 358  **Returns:**
 359  
 360  - – A dictionary with the following keys:
 361  - `documents`: Created Documents
 362  
 363  ## file_to_file_content
 364  
 365  ### FileToFileContent
 366  
 367  Converts files to FileContent objects to be included in ChatMessage objects.
 368  
 369  ### Usage example
 370  
 371  ```python
 372  from haystack.components.converters import FileToFileContent
 373  
 374  converter = FileToFileContent()
 375  
 376  sources = ["document.pdf", "video.mp4"]
 377  
 378  file_contents = converter.run(sources=sources)["file_contents"]
 379  print(file_contents)
 380  
 381  # [FileContent(base64_data='...',
 382  #              mime_type='application/pdf',
 383  #              filename='document.pdf',
 384  #              extra={}),
 385  #  ...]
 386  ```
 387  
 388  #### run
 389  
 390  ```python
 391  run(
 392      sources: list[str | Path | ByteStream],
 393      *,
 394      extra: dict[str, Any] | list[dict[str, Any]] | None = None
 395  ) -> dict[str, list[FileContent]]
 396  ```
 397  
 398  Converts files to FileContent objects.
 399  
 400  **Parameters:**
 401  
 402  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 403  - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific
 404    information.
 405    To avoid serialization issues, values should be JSON serializable.
 406    This value can be a list of dictionaries or a single dictionary.
 407    If it's a single dictionary, its content is added to the extra of all produced FileContent objects.
 408    If it's a list, its length must match the number of sources as they're zipped together.
 409  
 410  **Returns:**
 411  
 412  - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys:
 413  - `file_contents`: A list of FileContent objects.
 414  
 415  ## html
 416  
 417  ### HTMLToDocument
 418  
 419  Converts an HTML file to a Document.
 420  
 421  Usage example:
 422  
 423  ```python
 424  from haystack.components.converters import HTMLToDocument
 425  
 426  converter = HTMLToDocument()
 427  results = converter.run(sources=["path/to/sample.html"])
 428  documents = results["documents"]
 429  print(documents[0].content)
 430  # 'This is a text from the HTML file.'
 431  ```
 432  
 433  #### __init__
 434  
 435  ```python
 436  __init__(
 437      extraction_kwargs: dict[str, Any] | None = None,
 438      store_full_path: bool = False,
 439  )
 440  ```
 441  
 442  Create an HTMLToDocument component.
 443  
 444  **Parameters:**
 445  
 446  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These
 447    are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 448    the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 449  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 450    If False, only the file name is stored.
 451  
 452  #### to_dict
 453  
 454  ```python
 455  to_dict() -> dict[str, Any]
 456  ```
 457  
 458  Serializes the component to a dictionary.
 459  
 460  **Returns:**
 461  
 462  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 463  
 464  #### from_dict
 465  
 466  ```python
 467  from_dict(data: dict[str, Any]) -> HTMLToDocument
 468  ```
 469  
 470  Deserializes the component from a dictionary.
 471  
 472  **Parameters:**
 473  
 474  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 475  
 476  **Returns:**
 477  
 478  - <code>HTMLToDocument</code> – The deserialized component.
 479  
 480  #### run
 481  
 482  ```python
 483  run(
 484      sources: list[str | Path | ByteStream],
 485      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 486      extraction_kwargs: dict[str, Any] | None = None,
 487  )
 488  ```
 489  
 490  Converts a list of HTML files to Documents.
 491  
 492  **Parameters:**
 493  
 494  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
 495  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 496    This value can be either a list of dictionaries or a single dictionary.
 497    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 498    If it's a list, the length of the list must match the number of sources, because the two lists will
 499    be zipped.
 500    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 501  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process.
 502  
 503  **Returns:**
 504  
 505  - – A dictionary with the following keys:
 506  - `documents`: Created Documents
 507  
 508  ## image/document_to_image
 509  
 510  ### DocumentToImageContent
 511  
 512  Converts documents sourced from PDF and image files into ImageContents.
 513  
 514  This component processes a list of documents and extracts visual content from supported file formats, converting
 515  them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF
 516  documents by extracting specific pages as images.
 517  
 518  Documents are expected to have metadata containing:
 519  
 520  - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path`
 521  - A supported image format (MIME type must be one of the supported image types)
 522  - For PDF files, a `page_number` key specifying which page to extract
 523  
 524  ### Usage example
 525  
 526  ````
 527  ```python
 528  from haystack import Document
 529  from haystack.components.converters.image.document_to_image import DocumentToImageContent
 530  
 531  converter = DocumentToImageContent(
 532      file_path_meta_field="file_path",
 533      root_path="/data/files",
 534      detail="high",
 535      size=(800, 600)
 536  )
 537  
 538  documents = [
 539      Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}),
 540      Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1})
 541  ]
 542  
 543  result = converter.run(documents)
 544  image_contents = result["image_contents"]
 545  # [ImageContent(
 546  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'}
 547  #  ),
 548  #  ImageContent(
 549  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high',
 550  #    meta={'page_number': 1, 'file_path': 'doc.pdf'}
 551  #  )]
 552  ```
 553  ````
 554  
 555  #### __init__
 556  
 557  ```python
 558  __init__(
 559      *,
 560      file_path_meta_field: str = "file_path",
 561      root_path: str | None = None,
 562      detail: Literal["auto", "high", "low"] | None = None,
 563      size: tuple[int, int] | None = None
 564  )
 565  ```
 566  
 567  Initialize the DocumentToImageContent component.
 568  
 569  **Parameters:**
 570  
 571  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
 572  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
 573    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 574  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
 575    This will be passed to the created ImageContent objects.
 576  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 577    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 578    when working with models that have resolution constraints or when transmitting images to remote services.
 579  
 580  #### run
 581  
 582  ```python
 583  run(documents: list[Document]) -> dict[str, list[ImageContent | None]]
 584  ```
 585  
 586  Convert documents with image or PDF sources into ImageContent objects.
 587  
 588  This method processes the input documents, extracting images from supported file formats and converting them
 589  into ImageContent objects.
 590  
 591  **Parameters:**
 592  
 593  - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum
 594    a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which
 595    page to convert.
 596  
 597  **Returns:**
 598  
 599  - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key:
 600  - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image
 601    data and metadata. The order corresponds to order of input documents.
 602  
 603  **Raises:**
 604  
 605  - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported
 606    MIME type. The error message will specify which document and what information is missing or incorrect.
 607  
 608  ## image/file_to_document
 609  
 610  ### ImageFileToDocument
 611  
 612  Converts image file references into empty Document objects with associated metadata.
 613  
 614  This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be
 615  processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`.
 616  
 617  It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as
 618  their content and attaches metadata such as file path and any user-provided values.
 619  
 620  ### Usage example
 621  
 622  ```python
 623  from haystack.components.converters.image import ImageFileToDocument
 624  
 625  converter = ImageFileToDocument()
 626  
 627  sources = ["image.jpg", "another_image.png"]
 628  
 629  result = converter.run(sources=sources)
 630  documents = result["documents"]
 631  
 632  print(documents)
 633  
 634  # [Document(id=..., meta: {'file_path': 'image.jpg'}),
 635  # Document(id=..., meta: {'file_path': 'another_image.png'})]
 636  ```
 637  
 638  #### __init__
 639  
 640  ```python
 641  __init__(*, store_full_path: bool = False)
 642  ```
 643  
 644  Initialize the ImageFileToDocument component.
 645  
 646  **Parameters:**
 647  
 648  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 649    If False, only the file name is stored.
 650  
 651  #### run
 652  
 653  ```python
 654  run(
 655      *,
 656      sources: list[str | Path | ByteStream],
 657      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 658  ) -> dict[str, list[Document]]
 659  ```
 660  
 661  Convert image files into empty Document objects with metadata.
 662  
 663  This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects
 664  without content. These documents are enriched with metadata derived from the input source and optional
 665  user-provided metadata.
 666  
 667  **Parameters:**
 668  
 669  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 670  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 671    This value can be a list of dictionaries or a single dictionary.
 672    If it's a single dictionary, its content is added to the metadata of all produced documents.
 673    If it's a list, its length must match the number of sources, as they are zipped together.
 674    For ByteStream objects, their `meta` is added to the output documents.
 675  
 676  **Returns:**
 677  
 678  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing:
 679  - `documents`: A list of `Document` objects with empty content and associated metadata.
 680  
 681  ## image/file_to_image
 682  
 683  ### ImageFileToImageContent
 684  
 685  Converts image files to ImageContent objects.
 686  
 687  ### Usage example
 688  
 689  ```python
 690  from haystack.components.converters.image import ImageFileToImageContent
 691  
 692  converter = ImageFileToImageContent()
 693  
 694  sources = ["image.jpg", "another_image.png"]
 695  
 696  image_contents = converter.run(sources=sources)["image_contents"]
 697  print(image_contents)
 698  
 699  # [ImageContent(base64_image='...',
 700  #               mime_type='image/jpeg',
 701  #               detail=None,
 702  #               meta={'file_path': 'image.jpg'}),
 703  #  ...]
 704  ```
 705  
 706  #### __init__
 707  
 708  ```python
 709  __init__(
 710      *,
 711      detail: Literal["auto", "high", "low"] | None = None,
 712      size: tuple[int, int] | None = None
 713  )
 714  ```
 715  
 716  Create the ImageFileToImageContent component.
 717  
 718  **Parameters:**
 719  
 720  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 721    This will be passed to the created ImageContent objects.
 722  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 723    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 724    when working with models that have resolution constraints or when transmitting images to remote services.
 725  
 726  #### run
 727  
 728  ```python
 729  run(
 730      sources: list[str | Path | ByteStream],
 731      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 732      *,
 733      detail: Literal["auto", "high", "low"] | None = None,
 734      size: tuple[int, int] | None = None
 735  ) -> dict[str, list[ImageContent]]
 736  ```
 737  
 738  Converts files to ImageContent objects.
 739  
 740  **Parameters:**
 741  
 742  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 743  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 744    This value can be a list of dictionaries or a single dictionary.
 745    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 746    If it's a list, its length must match the number of sources as they're zipped together.
 747    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 748  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 749    This will be passed to the created ImageContent objects.
 750    If not provided, the detail level will be the one set in the constructor.
 751  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 752    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 753    when working with models that have resolution constraints or when transmitting images to remote services.
 754    If not provided, the size value will be the one set in the constructor.
 755  
 756  **Returns:**
 757  
 758  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 759  - `image_contents`: A list of ImageContent objects.
 760  
 761  ## image/pdf_to_image
 762  
 763  ### PDFToImageContent
 764  
 765  Converts PDF files to ImageContent objects.
 766  
 767  ### Usage example
 768  
 769  ```python
 770  from haystack.components.converters.image import PDFToImageContent
 771  
 772  converter = PDFToImageContent()
 773  
 774  sources = ["file.pdf", "another_file.pdf"]
 775  
 776  image_contents = converter.run(sources=sources)["image_contents"]
 777  print(image_contents)
 778  
 779  # [ImageContent(base64_image='...',
 780  #               mime_type='application/pdf',
 781  #               detail=None,
 782  #               meta={'file_path': 'file.pdf', 'page_number': 1}),
 783  #  ...]
 784  ```
 785  
 786  #### __init__
 787  
 788  ```python
 789  __init__(
 790      *,
 791      detail: Literal["auto", "high", "low"] | None = None,
 792      size: tuple[int, int] | None = None,
 793      page_range: list[str | int] | None = None
 794  )
 795  ```
 796  
 797  Create the PDFToImageContent component.
 798  
 799  **Parameters:**
 800  
 801  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 802    This will be passed to the created ImageContent objects.
 803  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 804    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 805    when working with models that have resolution constraints or when transmitting images to remote services.
 806  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 807    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 808    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 809    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 810    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 811  
 812  #### run
 813  
 814  ```python
 815  run(
 816      sources: list[str | Path | ByteStream],
 817      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 818      *,
 819      detail: Literal["auto", "high", "low"] | None = None,
 820      size: tuple[int, int] | None = None,
 821      page_range: list[str | int] | None = None
 822  ) -> dict[str, list[ImageContent]]
 823  ```
 824  
 825  Converts files to ImageContent objects.
 826  
 827  **Parameters:**
 828  
 829  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 830  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 831    This value can be a list of dictionaries or a single dictionary.
 832    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 833    If it's a list, its length must match the number of sources as they're zipped together.
 834    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 835  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 836    This will be passed to the created ImageContent objects.
 837    If not provided, the detail level will be the one set in the constructor.
 838  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 839    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 840    when working with models that have resolution constraints or when transmitting images to remote services.
 841    If not provided, the size value will be the one set in the constructor.
 842  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 843    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 844    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 845    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 846    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 847    If not provided, the page_range value will be the one set in the constructor.
 848  
 849  **Returns:**
 850  
 851  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 852  - `image_contents`: A list of ImageContent objects.
 853  
 854  ## json
 855  
 856  ### JSONConverter
 857  
 858  Converts one or more JSON files into a text document.
 859  
 860  ### Usage examples
 861  
 862  ```python
 863  import json
 864  
 865  from haystack.components.converters import JSONConverter
 866  from haystack.dataclasses import ByteStream
 867  
 868  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 869  
 870  converter = JSONConverter(content_key="text")
 871  results = converter.run(sources=[source])
 872  documents = results["documents"]
 873  print(documents[0].content)
 874  # 'This is the content of my document'
 875  ```
 876  
 877  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 878  to extract from the filtered data:
 879  
 880  ```python
 881  import json
 882  
 883  from haystack.components.converters import JSONConverter
 884  from haystack.dataclasses import ByteStream
 885  
 886  data = {
 887      "laureates": [
 888          {
 889              "firstname": "Enrico",
 890              "surname": "Fermi",
 891              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 892              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 893              " slow neutrons",
 894          },
 895          {
 896              "firstname": "Rita",
 897              "surname": "Levi-Montalcini",
 898              "motivation": "for their discoveries of growth factors",
 899          },
 900      ],
 901  }
 902  source = ByteStream.from_string(json.dumps(data))
 903  converter = JSONConverter(
 904      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 905  )
 906  
 907  results = converter.run(sources=[source])
 908  documents = results["documents"]
 909  print(documents[0].content)
 910  # 'for his demonstrations of the existence of new radioactive elements produced by
 911  # neutron irradiation, and for his related discovery of nuclear reactions brought
 912  # about by slow neutrons'
 913  
 914  print(documents[0].meta)
 915  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 916  
 917  print(documents[1].content)
 918  # 'for their discoveries of growth factors'
 919  
 920  print(documents[1].meta)
 921  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 922  ```
 923  
 924  #### __init__
 925  
 926  ```python
 927  __init__(
 928      jq_schema: str | None = None,
 929      content_key: str | None = None,
 930      extra_meta_fields: set[str] | Literal["*"] | None = None,
 931      store_full_path: bool = False,
 932  )
 933  ```
 934  
 935  Creates a JSONConverter component.
 936  
 937  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 938  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 939  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 940  
 941  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 942  be set as the document's content.
 943  
 944  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 945  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 946  
 947  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 948  it will be skipped.
 949  
 950  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 951  
 952  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 953  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 954  the extracted documents. If a field is not found, the meta value will be `None`.
 955  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 956  be saved as metadata.
 957  
 958  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 959  
 960  **Parameters:**
 961  
 962  - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content.
 963    If not specified, whole JSON object will be used to extract information.
 964  - **content_key** (<code>str | None</code>) – Optional key to extract document content.
 965    If `jq_schema` is specified, the `content_key` will be extracted from that object.
 966  - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content.
 967    If `jq_schema` is specified, all keys will be extracted from that object.
 968  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 969    If False, only the file name is stored.
 970  
 971  #### to_dict
 972  
 973  ```python
 974  to_dict() -> dict[str, Any]
 975  ```
 976  
 977  Serializes the component to a dictionary.
 978  
 979  **Returns:**
 980  
 981  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 982  
 983  #### from_dict
 984  
 985  ```python
 986  from_dict(data: dict[str, Any]) -> JSONConverter
 987  ```
 988  
 989  Deserializes the component from a dictionary.
 990  
 991  **Parameters:**
 992  
 993  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 994  
 995  **Returns:**
 996  
 997  - <code>JSONConverter</code> – Deserialized component.
 998  
 999  #### run
1000  
1001  ```python
1002  run(
1003      sources: list[str | Path | ByteStream],
1004      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1005  )
1006  ```
1007  
1008  Converts a list of JSON files to documents.
1009  
1010  **Parameters:**
1011  
1012  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects.
1013  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1014    This value can be either a list of dictionaries or a single dictionary.
1015    If it's a single dictionary, its content is added to the metadata of all produced documents.
1016    If it's a list, the length of the list must match the number of sources.
1017    If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
1018  
1019  **Returns:**
1020  
1021  - – A dictionary with the following keys:
1022  - `documents`: A list of created documents.
1023  
1024  ## markdown
1025  
1026  ### MarkdownToDocument
1027  
1028  Converts a Markdown file into a text Document.
1029  
1030  Usage example:
1031  
1032  ```python
1033  from haystack.components.converters import MarkdownToDocument
1034  from datetime import datetime
1035  
1036  converter = MarkdownToDocument()
1037  results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
1038  documents = results["documents"]
1039  print(documents[0].content)
1040  # 'This is a text from the markdown file.'
1041  ```
1042  
1043  #### __init__
1044  
1045  ```python
1046  __init__(
1047      table_to_single_line: bool = False,
1048      progress_bar: bool = True,
1049      store_full_path: bool = False,
1050  )
1051  ```
1052  
1053  Create a MarkdownToDocument component.
1054  
1055  **Parameters:**
1056  
1057  - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line.
1058  - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running.
1059  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1060    If False, only the file name is stored.
1061  
1062  #### run
1063  
1064  ```python
1065  run(
1066      sources: list[str | Path | ByteStream],
1067      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1068  )
1069  ```
1070  
1071  Converts a list of Markdown files to Documents.
1072  
1073  **Parameters:**
1074  
1075  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1076  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1077    This value can be either a list of dictionaries or a single dictionary.
1078    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1079    If it's a list, the length of the list must match the number of sources, because the two lists will
1080    be zipped.
1081    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1082  
1083  **Returns:**
1084  
1085  - – A dictionary with the following keys:
1086  - `documents`: List of created Documents
1087  
1088  ## msg
1089  
1090  ### MSGToDocument
1091  
1092  Converts Microsoft Outlook .msg files into Haystack Documents.
1093  
1094  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
1095  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
1096  file are extracted as ByteStream objects.
1097  
1098  ### Example Usage
1099  
1100  ```python
1101  from haystack.components.converters.msg import MSGToDocument
1102  from datetime import datetime
1103  
1104  converter = MSGToDocument()
1105  results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
1106  documents = results["documents"]
1107  attachments = results["attachments"]
1108  print(documents[0].content)
1109  ```
1110  
1111  #### __init__
1112  
1113  ```python
1114  __init__(store_full_path: bool = False) -> None
1115  ```
1116  
1117  Creates a MSGToDocument component.
1118  
1119  **Parameters:**
1120  
1121  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1122    If False, only the file name is stored.
1123  
1124  #### run
1125  
1126  ```python
1127  run(
1128      sources: list[str | Path | ByteStream],
1129      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1130  ) -> dict[str, list[Document] | list[ByteStream]]
1131  ```
1132  
1133  Converts MSG files to Documents.
1134  
1135  **Parameters:**
1136  
1137  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1138  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1139    This value can be either a list of dictionaries or a single dictionary.
1140    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1141    If it's a list, the length of the list must match the number of sources, because the two lists will
1142    be zipped.
1143    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1144  
1145  **Returns:**
1146  
1147  - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys:
1148  - `documents`: Created Documents.
1149  - `attachments`: Created ByteStream objects from file attachments.
1150  
1151  ## multi_file_converter
1152  
1153  ### MultiFileConverter
1154  
1155  A file converter that handles conversion of multiple file types.
1156  
1157  The MultiFileConverter handles the following file types:
1158  
1159  - CSV
1160  - DOCX
1161  - HTML
1162  - JSON
1163  - MD
1164  - TEXT
1165  - PDF (no OCR)
1166  - PPTX
1167  - XLSX
1168  
1169  Usage example:
1170  
1171  ```
1172  from haystack.super_components.converters import MultiFileConverter
1173  
1174  converter = MultiFileConverter()
1175  converter.run(sources=["test.txt", "test.pdf"], meta={})
1176  ```
1177  
1178  #### __init__
1179  
1180  ```python
1181  __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None
1182  ```
1183  
1184  Initialize the MultiFileConverter.
1185  
1186  **Parameters:**
1187  
1188  - **encoding** (<code>str</code>) – The encoding to use when reading files.
1189  - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files.
1190  
1191  ## openapi_functions
1192  
1193  ### OpenAPIServiceToFunctions
1194  
1195  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
1196  
1197  The definition must respect OpenAPI specification 3.0.0 or higher.
1198  It can be specified in JSON or YAML format.
1199  Each function must have:
1200  \- unique operationId
1201  \- description
1202  \- requestBody and/or parameters
1203  \- schema for the requestBody and/or parameters
1204  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
1205  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
1206  
1207  Usage example:
1208  
1209  ```python
1210  from haystack.components.converters import OpenAPIServiceToFunctions
1211  
1212  converter = OpenAPIServiceToFunctions()
1213  result = converter.run(sources=["path/to/openapi_definition.yaml"])
1214  assert result["functions"]
1215  ```
1216  
1217  #### __init__
1218  
1219  ```python
1220  __init__()
1221  ```
1222  
1223  Create an OpenAPIServiceToFunctions component.
1224  
1225  #### run
1226  
1227  ```python
1228  run(sources: list[str | Path | ByteStream]) -> dict[str, Any]
1229  ```
1230  
1231  Converts OpenAPI definitions in OpenAI function calling format.
1232  
1233  **Parameters:**
1234  
1235  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
1236  
1237  **Returns:**
1238  
1239  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1240  - functions: Function definitions in JSON object format
1241  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
1242  
1243  **Raises:**
1244  
1245  - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed.
1246  - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions.
1247  
1248  ## output_adapter
1249  
1250  ### OutputAdaptationException
1251  
1252  Bases: <code>Exception</code>
1253  
1254  Exception raised when there is an error during output adaptation.
1255  
1256  ### OutputAdapter
1257  
1258  Adapts output of a Component using Jinja templates.
1259  
1260  Usage example:
1261  
1262  ```python
1263  from haystack import Document
1264  from haystack.components.converters import OutputAdapter
1265  
1266  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
1267  documents = [Document(content="Test content"]
1268  result = adapter.run(documents=documents)
1269  
1270  assert result["output"] == "Test content"
1271  ```
1272  
1273  #### __init__
1274  
1275  ```python
1276  __init__(
1277      template: str,
1278      output_type: TypeAlias,
1279      custom_filters: dict[str, Callable] | None = None,
1280      unsafe: bool = False,
1281  ) -> None
1282  ```
1283  
1284  Create an OutputAdapter component.
1285  
1286  **Parameters:**
1287  
1288  - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data.
1289    The variables in the template define the input of this instance.
1290    e.g.
1291    With this template:
1292  
1293  ```
1294  {{ documents[0].content }}
1295  ```
1296  
1297  The Component input will be `documents`.
1298  
1299  - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return.
1300  - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template.
1301  - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template.
1302    This should only be used if you trust the source of the template as it can be lead to remote code execution.
1303  
1304  #### run
1305  
1306  ```python
1307  run(**kwargs)
1308  ```
1309  
1310  Renders the Jinja template with the provided inputs.
1311  
1312  **Parameters:**
1313  
1314  - **kwargs** – Must contain all variables used in the `template` string.
1315  
1316  **Returns:**
1317  
1318  - – A dictionary with the following keys:
1319  - `output`: Rendered Jinja template.
1320  
1321  **Raises:**
1322  
1323  - <code>OutputAdaptationException</code> – If template rendering fails.
1324  
1325  #### to_dict
1326  
1327  ```python
1328  to_dict() -> dict[str, Any]
1329  ```
1330  
1331  Serializes the component to a dictionary.
1332  
1333  **Returns:**
1334  
1335  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1336  
1337  #### from_dict
1338  
1339  ```python
1340  from_dict(data: dict[str, Any]) -> OutputAdapter
1341  ```
1342  
1343  Deserializes the component from a dictionary.
1344  
1345  **Parameters:**
1346  
1347  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
1348  
1349  **Returns:**
1350  
1351  - <code>OutputAdapter</code> – The deserialized component.
1352  
1353  ## pdfminer
1354  
1355  ### PDFMinerToDocument
1356  
1357  Converts PDF files to Documents.
1358  
1359  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1360  
1361  Usage example:
1362  
1363  ```python
1364  from haystack.components.converters.pdfminer import PDFMinerToDocument
1365  
1366  converter = PDFMinerToDocument()
1367  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1368  documents = results["documents"]
1369  print(documents[0].content)
1370  # 'This is a text from the PDF file.'
1371  ```
1372  
1373  #### __init__
1374  
1375  ```python
1376  __init__(
1377      line_overlap: float = 0.5,
1378      char_margin: float = 2.0,
1379      line_margin: float = 0.5,
1380      word_margin: float = 0.1,
1381      boxes_flow: float | None = 0.5,
1382      detect_vertical: bool = True,
1383      all_texts: bool = False,
1384      store_full_path: bool = False,
1385  ) -> None
1386  ```
1387  
1388  Create a PDFMinerToDocument component.
1389  
1390  **Parameters:**
1391  
1392  - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on
1393    the same line based on the amount of overlap between them.
1394    The overlap is calculated relative to the minimum height of both characters.
1395  - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them.
1396    If the distance is less than the margin specified, the characters are considered to be on the same line.
1397    The margin is calculated relative to the width of the character.
1398  - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word
1399    based on the distance between them. If the distance is greater than the margin specified,
1400    an intermediate space will be added between them to make the text more readable.
1401    The margin is calculated relative to the width of the character.
1402  - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on
1403    the distance between them. If the distance is less than the margin specified,
1404    the lines are considered to be part of the same paragraph.
1405    The margin is calculated relative to the height of a line.
1406  - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when
1407    determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1408    with -1.0 indicating that only horizontal position matters and +1.0 indicating
1409    that only vertical position matters. Setting the value to 'None' will disable advanced
1410    layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1411  - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis.
1412  - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures.
1413  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1414    If False, only the file name is stored.
1415  
1416  #### detect_undecoded_cid_characters
1417  
1418  ```python
1419  detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1420  ```
1421  
1422  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1423  
1424  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1425  non-standard fonts.
1426  
1427  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1428  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1429  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1430  as is.
1431  
1432  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1433  
1434  **Parameters:**
1435  
1436  - **text** (<code>str</code>) – The text to check for undecoded CID characters
1437  
1438  **Returns:**
1439  
1440  - <code>dict\[str, Any\]</code> – A dictionary containing detection results
1441  
1442  #### run
1443  
1444  ```python
1445  run(
1446      sources: list[str | Path | ByteStream],
1447      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1448  )
1449  ```
1450  
1451  Converts PDF files to Documents.
1452  
1453  **Parameters:**
1454  
1455  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects.
1456  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1457    This value can be either a list of dictionaries or a single dictionary.
1458    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1459    If it's a list, the length of the list must match the number of sources, because the two lists will
1460    be zipped.
1461    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1462  
1463  **Returns:**
1464  
1465  - – A dictionary with the following keys:
1466  - `documents`: Created Documents
1467  
1468  ## pptx
1469  
1470  ### PPTXToDocument
1471  
1472  Converts PPTX files to Documents.
1473  
1474  Usage example:
1475  
1476  ```python
1477  from haystack.components.converters.pptx import PPTXToDocument
1478  
1479  converter = PPTXToDocument()
1480  results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
1481  documents = results["documents"]
1482  print(documents[0].content)
1483  # 'This is the text from the PPTX file.'
1484  ```
1485  
1486  #### __init__
1487  
1488  ```python
1489  __init__(
1490      store_full_path: bool = False,
1491      link_format: Literal["markdown", "plain", "none"] = "none",
1492  )
1493  ```
1494  
1495  Create a PPTXToDocument component.
1496  
1497  **Parameters:**
1498  
1499  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1500    If False, only the file name is stored.
1501  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1502  - `"markdown"`: `[text](url)`
1503  - `"plain"`: `text (url)`
1504  - `"none"`: Only the text is extracted, link addresses are ignored.
1505  
1506  #### to_dict
1507  
1508  ```python
1509  to_dict() -> dict[str, Any]
1510  ```
1511  
1512  Serializes the component to a dictionary.
1513  
1514  **Returns:**
1515  
1516  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1517  
1518  #### run
1519  
1520  ```python
1521  run(
1522      sources: list[str | Path | ByteStream],
1523      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1524  )
1525  ```
1526  
1527  Converts PPTX files to Documents.
1528  
1529  **Parameters:**
1530  
1531  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1532  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1533    This value can be either a list of dictionaries or a single dictionary.
1534    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1535    If it's a list, the length of the list must match the number of sources, because the two lists will
1536    be zipped.
1537    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1538  
1539  **Returns:**
1540  
1541  - – A dictionary with the following keys:
1542  - `documents`: Created Documents
1543  
1544  ## pypdf
1545  
1546  ### PyPDFExtractionMode
1547  
1548  Bases: <code>Enum</code>
1549  
1550  The mode to use for extracting text from a PDF.
1551  
1552  #### from_str
1553  
1554  ```python
1555  from_str(string: str) -> PyPDFExtractionMode
1556  ```
1557  
1558  Convert a string to a PyPDFExtractionMode enum.
1559  
1560  ### PyPDFToDocument
1561  
1562  Converts PDF files to documents your pipeline can query.
1563  
1564  This component uses the PyPDF library.
1565  You can attach metadata to the resulting documents.
1566  
1567  ### Usage example
1568  
1569  ```python
1570  from haystack.components.converters.pypdf import PyPDFToDocument
1571  
1572  converter = PyPDFToDocument()
1573  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1574  documents = results["documents"]
1575  print(documents[0].content)
1576  # 'This is a text from the PDF file.'
1577  ```
1578  
1579  #### __init__
1580  
1581  ```python
1582  __init__(
1583      *,
1584      extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN,
1585      plain_mode_orientations: tuple = (0, 90, 180, 270),
1586      plain_mode_space_width: float = 200.0,
1587      layout_mode_space_vertically: bool = True,
1588      layout_mode_scale_weight: float = 1.25,
1589      layout_mode_strip_rotated: bool = True,
1590      layout_mode_font_height_weight: float = 1.0,
1591      store_full_path: bool = False
1592  )
1593  ```
1594  
1595  Create an PyPDFToDocument component.
1596  
1597  **Parameters:**
1598  
1599  - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF.
1600    Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1601  - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode.
1602    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1603  - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font.
1604    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1605  - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height.
1606    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1607  - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width.
1608    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1609  - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1610    If rotated text is discovered, layout will be degraded and a warning will be logged.
1611    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1612  - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height.
1613    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1614  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1615    If False, only the file name is stored.
1616  
1617  #### to_dict
1618  
1619  ```python
1620  to_dict()
1621  ```
1622  
1623  Serializes the component to a dictionary.
1624  
1625  **Returns:**
1626  
1627  - – Dictionary with serialized data.
1628  
1629  #### from_dict
1630  
1631  ```python
1632  from_dict(data)
1633  ```
1634  
1635  Deserializes the component from a dictionary.
1636  
1637  **Parameters:**
1638  
1639  - **data** – Dictionary with serialized data.
1640  
1641  **Returns:**
1642  
1643  - – Deserialized component.
1644  
1645  #### run
1646  
1647  ```python
1648  run(
1649      sources: list[str | Path | ByteStream],
1650      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1651  )
1652  ```
1653  
1654  Converts PDF files to documents.
1655  
1656  **Parameters:**
1657  
1658  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
1659  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1660    This value can be a list of dictionaries or a single dictionary.
1661    If it's a single dictionary, its content is added to the metadata of all produced documents.
1662    If it's a list, its length must match the number of sources, as they are zipped together.
1663    For ByteStream objects, their `meta` is added to the output documents.
1664  
1665  **Returns:**
1666  
1667  - – A dictionary with the following keys:
1668  - `documents`: A list of converted documents.
1669  
1670  ## tika
1671  
1672  ### XHTMLParser
1673  
1674  Bases: <code>HTMLParser</code>
1675  
1676  Custom parser to extract pages from Tika XHTML content.
1677  
1678  #### handle_starttag
1679  
1680  ```python
1681  handle_starttag(tag: str, attrs: list[tuple])
1682  ```
1683  
1684  Identify the start of a page div.
1685  
1686  #### handle_endtag
1687  
1688  ```python
1689  handle_endtag(tag: str)
1690  ```
1691  
1692  Identify the end of a page div.
1693  
1694  #### handle_data
1695  
1696  ```python
1697  handle_data(data: str)
1698  ```
1699  
1700  Populate the page content.
1701  
1702  ### TikaDocumentConverter
1703  
1704  Converts files of different types to Documents.
1705  
1706  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1707  requires a running Tika server.
1708  For more options on running Tika,
1709  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1710  
1711  Usage example:
1712  
1713  ```python
1714  from haystack.components.converters.tika import TikaDocumentConverter
1715  
1716  converter = TikaDocumentConverter()
1717  results = converter.run(
1718      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1719      meta={"date_added": datetime.now().isoformat()}
1720  )
1721  documents = results["documents"]
1722  print(documents[0].content)
1723  # 'This is a text from the docx file.'
1724  ```
1725  
1726  #### __init__
1727  
1728  ```python
1729  __init__(
1730      tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
1731  )
1732  ```
1733  
1734  Create a TikaDocumentConverter component.
1735  
1736  **Parameters:**
1737  
1738  - **tika_url** (<code>str</code>) – Tika server URL.
1739  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1740    If False, only the file name is stored.
1741  
1742  #### run
1743  
1744  ```python
1745  run(
1746      sources: list[str | Path | ByteStream],
1747      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1748  )
1749  ```
1750  
1751  Converts files to Documents.
1752  
1753  **Parameters:**
1754  
1755  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
1756  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1757    This value can be either a list of dictionaries or a single dictionary.
1758    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1759    If it's a list, the length of the list must match the number of sources, because the two lists will
1760    be zipped.
1761    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1762  
1763  **Returns:**
1764  
1765  - – A dictionary with the following keys:
1766  - `documents`: Created Documents
1767  
1768  ## txt
1769  
1770  ### TextFileToDocument
1771  
1772  Converts text files to documents your pipeline can query.
1773  
1774  By default, it uses UTF-8 encoding when converting files but
1775  you can also set custom encoding.
1776  It can attach metadata to the resulting documents.
1777  
1778  ### Usage example
1779  
1780  ```python
1781  from haystack.components.converters.txt import TextFileToDocument
1782  
1783  converter = TextFileToDocument()
1784  results = converter.run(sources=["sample.txt"])
1785  documents = results["documents"]
1786  print(documents[0].content)
1787  # 'This is the content from the txt file.'
1788  ```
1789  
1790  #### __init__
1791  
1792  ```python
1793  __init__(encoding: str = 'utf-8', store_full_path: bool = False)
1794  ```
1795  
1796  Creates a TextFileToDocument component.
1797  
1798  **Parameters:**
1799  
1800  - **encoding** (<code>str</code>) – The encoding of the text files to convert.
1801    If the encoding is specified in the metadata of a source ByteStream,
1802    it overrides this value.
1803  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1804    If False, only the file name is stored.
1805  
1806  #### run
1807  
1808  ```python
1809  run(
1810      sources: list[str | Path | ByteStream],
1811      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1812  )
1813  ```
1814  
1815  Converts text files to documents.
1816  
1817  **Parameters:**
1818  
1819  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert.
1820  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1821    This value can be a list of dictionaries or a single dictionary.
1822    If it's a single dictionary, its content is added to the metadata of all produced documents.
1823    If it's a list, its length must match the number of sources as they're zipped together.
1824    For ByteStream objects, their `meta` is added to the output documents.
1825  
1826  **Returns:**
1827  
1828  - – A dictionary with the following keys:
1829  - `documents`: A list of converted documents.
1830  
1831  ## xlsx
1832  
1833  ### XLSXToDocument
1834  
1835  ````
1836  Converts XLSX (Excel) files into Documents.
1837  
1838  Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1839  created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1840  
1841  ### Usage example
1842  
1843  ```python
1844  from haystack.components.converters.xlsx import XLSXToDocument
1845  
1846  converter = XLSXToDocument()
1847  results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
1848  documents = results["documents"]
1849  print(documents[0].content)
1850  # ",A,B
1851  ````
1852  
1853  1,col_a,col_b
1854  2,1.5,test
1855  "
1856  \`\`\`
1857  
1858  #### __init__
1859  
1860  ```python
1861  __init__(
1862      table_format: Literal["csv", "markdown"] = "csv",
1863      sheet_name: str | int | list[str | int] | None = None,
1864      read_excel_kwargs: dict[str, Any] | None = None,
1865      table_format_kwargs: dict[str, Any] | None = None,
1866      *,
1867      link_format: Literal["markdown", "plain", "none"] = "none",
1868      store_full_path: bool = False
1869  )
1870  ```
1871  
1872  Creates a XLSXToDocument component.
1873  
1874  **Parameters:**
1875  
1876  - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to.
1877  - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read.
1878  - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`.
1879    See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1880  - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function.
1881  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1882    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1883  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1884    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1885  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1886  - `"markdown"`: `[text](url)`
1887  - `"plain"`: `text (url)`
1888  - `"none"`: Only the text is extracted, link addresses are ignored.
1889  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1890    If False, only the file name is stored.
1891  
1892  #### run
1893  
1894  ```python
1895  run(
1896      sources: list[str | Path | ByteStream],
1897      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1898  ) -> dict[str, list[Document]]
1899  ```
1900  
1901  Converts a XLSX file to a Document.
1902  
1903  **Parameters:**
1904  
1905  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1906  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1907    This value can be either a list of dictionaries or a single dictionary.
1908    If it's a single dictionary, its content is added to the metadata of all produced documents.
1909    If it's a list, the length of the list must match the number of sources, because the two lists will
1910    be zipped.
1911    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1912  
1913  **Returns:**
1914  
1915  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1916  - `documents`: Created documents