Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.28 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: "Converters"
   3  id: converters-api
   4  description: "Various converters to transform data from one format to another."
   5  slug: "/converters-api"
   6  ---
   7  
   8  
   9  ## azure
  10  
  11  ### AzureOCRDocumentConverter
  12  
  13  Converts files to documents using Azure's Document Intelligence service.
  14  
  15  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  16  
  17  To use this component, you need an active Azure account
  18  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  19  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  20  
  21  ### Usage example
  22  
  23  <!-- test-ignore -->
  24  
  25  ```python
  26  import os
  27  from datetime import datetime
  28  from haystack.components.converters import AzureOCRDocumentConverter
  29  from haystack.utils import Secret
  30  
  31  converter = AzureOCRDocumentConverter(
  32      endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
  33      api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
  34  )
  35  results = converter.run(
  36      sources=["test/test_files/pdf/react_paper.pdf"],
  37      meta={"date_added": datetime.now().isoformat()},
  38  )
  39  documents = results["documents"]
  40  print(documents[0].content)
  41  # 'This is a text from the PDF file.'
  42  ```
  43  
  44  #### __init__
  45  
  46  ```python
  47  __init__(
  48      endpoint: str,
  49      api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  50      model_id: str = "prebuilt-read",
  51      preceding_context_len: int = 3,
  52      following_context_len: int = 3,
  53      merge_multiple_column_headers: bool = True,
  54      page_layout: Literal["natural", "single_column"] = "natural",
  55      threshold_y: float | None = 0.05,
  56      store_full_path: bool = False,
  57  ) -> None
  58  ```
  59  
  60  Creates an AzureOCRDocumentConverter component.
  61  
  62  **Parameters:**
  63  
  64  - **endpoint** (<code>str</code>) – The endpoint of your Azure resource.
  65  - **api_key** (<code>Secret</code>) – The API key of your Azure resource.
  66  - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation]
  67    (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  68  - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context
  69    (this will be added to the metadata).
  70  - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context (
  71    this will be added to the metadata).
  72  - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row.
  73  - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options:
  74  - `natural`: Uses the natural reading order determined by Azure.
  75  - `single_column`: Groups all lines with the same height on the page based on a threshold
  76    determined by `threshold_y`.
  77  - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`.
  78    The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  79    single line. This is crucial for section headers or numbers which may be spatially separated
  80    from the remaining text on the horizontal axis.
  81  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
  82    If False, only the file name is stored.
  83  
  84  #### run
  85  
  86  ```python
  87  run(
  88      sources: list[str | Path | ByteStream],
  89      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
  90  ) -> dict[str, Any]
  91  ```
  92  
  93  Convert a list of files to Documents using Azure's Document Intelligence service.
  94  
  95  **Parameters:**
  96  
  97  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
  98  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
  99    This value can be either a list of dictionaries or a single dictionary.
 100    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 101    If it's a list, the length of the list must match the number of sources, because the two lists will be
 102    zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 103  
 104  **Returns:**
 105  
 106  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 107  - `documents`: List of created Documents
 108  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 109  
 110  #### to_dict
 111  
 112  ```python
 113  to_dict() -> dict[str, Any]
 114  ```
 115  
 116  Serializes the component to a dictionary.
 117  
 118  **Returns:**
 119  
 120  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 121  
 122  #### from_dict
 123  
 124  ```python
 125  from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter
 126  ```
 127  
 128  Deserializes the component from a dictionary.
 129  
 130  **Parameters:**
 131  
 132  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 133  
 134  **Returns:**
 135  
 136  - <code>AzureOCRDocumentConverter</code> – The deserialized component.
 137  
 138  ## csv
 139  
 140  ### CSVToDocument
 141  
 142  Converts CSV files to Documents.
 143  
 144  By default, it uses UTF-8 encoding when converting files but
 145  you can also set a custom encoding.
 146  It can attach metadata to the resulting documents.
 147  
 148  ### Usage example
 149  
 150  ```python
 151  from haystack.components.converters.csv import CSVToDocument
 152  from datetime import datetime
 153  
 154  converter = CSVToDocument()
 155  results = converter.run(
 156      sources=["test/test_files/csv/sample_1.csv"], meta={"date_added": datetime.now().isoformat()}
 157  )
 158  documents = results["documents"]
 159  
 160  print(documents[0].content)
 161  # >>  'col1,col2\nrow1,row1\nrow2,row2\n'
 162  ```
 163  
 164  #### __init__
 165  
 166  ```python
 167  __init__(
 168      encoding: str = "utf-8",
 169      store_full_path: bool = False,
 170      *,
 171      conversion_mode: Literal["file", "row"] = "file",
 172      delimiter: str = ",",
 173      quotechar: str = '"'
 174  ) -> None
 175  ```
 176  
 177  Creates a CSVToDocument component.
 178  
 179  **Parameters:**
 180  
 181  - **encoding** (<code>str</code>) – The encoding of the csv files to convert.
 182    If the encoding is specified in the metadata of a source ByteStream,
 183    it overrides this value.
 184  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 185    If False, only the file name is stored.
 186  - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text.
 187  - "row": convert each CSV row to its own Document (requires `content_column` in `run()`).
 188  - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`).
 189  - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`).
 190  
 191  #### run
 192  
 193  ```python
 194  run(
 195      sources: list[str | Path | ByteStream],
 196      *,
 197      content_column: str | None = None,
 198      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 199  ) -> dict[str, Any]
 200  ```
 201  
 202  Converts CSV files to a Document (file mode) or to one Document per row (row mode).
 203  
 204  **Parameters:**
 205  
 206  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 207  - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`.
 208    The column name whose values become `Document.content` for each row.
 209    The column must exist in the CSV header.
 210  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 211    This value can be either a list of dictionaries or a single dictionary.
 212    If it's a single dictionary, its content is added to the metadata of all produced documents.
 213    If it's a list, the length of the list must match the number of sources, because the two lists will
 214    be zipped.
 215    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 216  
 217  **Returns:**
 218  
 219  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 220  - `documents`: Created documents
 221  
 222  ## docx
 223  
 224  ### DOCXMetadata
 225  
 226  Describes the metadata of Docx file.
 227  
 228  **Parameters:**
 229  
 230  - **author** (<code>str</code>) – The author
 231  - **category** (<code>str</code>) – The category
 232  - **comments** (<code>str</code>) – The comments
 233  - **content_status** (<code>str</code>) – The content status
 234  - **created** (<code>str | None</code>) – The creation date (ISO formatted string)
 235  - **identifier** (<code>str</code>) – The identifier
 236  - **keywords** (<code>str</code>) – Available keywords
 237  - **language** (<code>str</code>) – The language of the document
 238  - **last_modified_by** (<code>str</code>) – User who last modified the document
 239  - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string)
 240  - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string)
 241  - **revision** (<code>int</code>) – The revision number
 242  - **subject** (<code>str</code>) – The subject
 243  - **title** (<code>str</code>) – The title
 244  - **version** (<code>str</code>) – The version
 245  
 246  ### DOCXTableFormat
 247  
 248  Bases: <code>Enum</code>
 249  
 250  Supported formats for storing DOCX tabular data in a Document.
 251  
 252  #### from_str
 253  
 254  ```python
 255  from_str(string: str) -> DOCXTableFormat
 256  ```
 257  
 258  Convert a string to a DOCXTableFormat enum.
 259  
 260  ### DOCXLinkFormat
 261  
 262  Bases: <code>Enum</code>
 263  
 264  Supported formats for storing DOCX link information in a Document.
 265  
 266  #### from_str
 267  
 268  ```python
 269  from_str(string: str) -> DOCXLinkFormat
 270  ```
 271  
 272  Convert a string to a DOCXLinkFormat enum.
 273  
 274  ### DOCXToDocument
 275  
 276  Converts DOCX files to Documents.
 277  
 278  Uses `python-docx` library to convert the DOCX file to a document.
 279  This component does not preserve page breaks in the original document.
 280  
 281  Usage example:
 282  
 283  ```python
 284  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 285  from datetime import datetime
 286  
 287  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 288  results = converter.run(
 289      sources=["test/test_files/docx/sample_docx.docx"], meta={"date_added": datetime.now().isoformat()}
 290  )
 291  documents = results["documents"]
 292  
 293  print(documents[0].content)
 294  # >> 'This is a text from the DOCX file.'
 295  ```
 296  
 297  #### __init__
 298  
 299  ```python
 300  __init__(
 301      table_format: str | DOCXTableFormat = DOCXTableFormat.CSV,
 302      link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE,
 303      store_full_path: bool = False,
 304  ) -> None
 305  ```
 306  
 307  Create a DOCXToDocument component.
 308  
 309  **Parameters:**
 310  
 311  - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 312    DOCXTableFormat.CSV, "markdown", or "csv".
 313  - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either:
 314    DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 315    DOCXLinkFormat.PLAIN or "plain" to get text (address),
 316    DOCXLinkFormat.NONE or "none" to get text without links.
 317  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 318    If False, only the file name is stored.
 319  
 320  #### to_dict
 321  
 322  ```python
 323  to_dict() -> dict[str, Any]
 324  ```
 325  
 326  Serializes the component to a dictionary.
 327  
 328  **Returns:**
 329  
 330  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 331  
 332  #### from_dict
 333  
 334  ```python
 335  from_dict(data: dict[str, Any]) -> DOCXToDocument
 336  ```
 337  
 338  Deserializes the component from a dictionary.
 339  
 340  **Parameters:**
 341  
 342  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 343  
 344  **Returns:**
 345  
 346  - <code>DOCXToDocument</code> – The deserialized component.
 347  
 348  #### run
 349  
 350  ```python
 351  run(
 352      sources: list[str | Path | ByteStream],
 353      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 354  ) -> dict[str, Any]
 355  ```
 356  
 357  Converts DOCX files to Documents.
 358  
 359  **Parameters:**
 360  
 361  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
 362  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 363    This value can be either a list of dictionaries or a single dictionary.
 364    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 365    If it's a list, the length of the list must match the number of sources, because the two lists will
 366    be zipped.
 367    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 368  
 369  **Returns:**
 370  
 371  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 372  - `documents`: Created Documents
 373  
 374  ## file_to_file_content
 375  
 376  ### FileToFileContent
 377  
 378  Converts files to FileContent objects to be included in ChatMessage objects.
 379  
 380  ### Usage example
 381  
 382  <!-- test-ignore -->
 383  
 384  ```python
 385  from haystack.components.converters import FileToFileContent
 386  
 387  converter = FileToFileContent()
 388  
 389  sources = ["document.pdf", "video.mp4"]
 390  
 391  file_contents = converter.run(sources=sources)["file_contents"]
 392  print(file_contents)
 393  
 394  # [FileContent(base64_data='...',
 395  #              mime_type='application/pdf',
 396  #              filename='document.pdf',
 397  #              extra={}),
 398  #  ...]
 399  ```
 400  
 401  #### run
 402  
 403  ```python
 404  run(
 405      sources: list[str | Path | ByteStream],
 406      *,
 407      extra: dict[str, Any] | list[dict[str, Any]] | None = None
 408  ) -> dict[str, list[FileContent]]
 409  ```
 410  
 411  Converts files to FileContent objects.
 412  
 413  **Parameters:**
 414  
 415  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 416  - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific
 417    information.
 418    To avoid serialization issues, values should be JSON serializable.
 419    This value can be a list of dictionaries or a single dictionary.
 420    If it's a single dictionary, its content is added to the extra of all produced FileContent objects.
 421    If it's a list, its length must match the number of sources as they're zipped together.
 422  
 423  **Returns:**
 424  
 425  - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys:
 426  - `file_contents`: A list of FileContent objects.
 427  
 428  ## html
 429  
 430  ### HTMLToDocument
 431  
 432  Converts an HTML file to a Document.
 433  
 434  Usage example:
 435  
 436  ```python
 437  from haystack.components.converters import HTMLToDocument
 438  
 439  converter = HTMLToDocument()
 440  results = converter.run(sources=["test/test_files/html/paul_graham_superlinear.html"])
 441  documents = results["documents"]
 442  
 443  print(documents[0].content)
 444  # >> 'This is a text from the HTML file.'
 445  ```
 446  
 447  #### __init__
 448  
 449  ```python
 450  __init__(
 451      extraction_kwargs: dict[str, Any] | None = None,
 452      store_full_path: bool = False,
 453  ) -> None
 454  ```
 455  
 456  Create an HTMLToDocument component.
 457  
 458  **Parameters:**
 459  
 460  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These
 461    are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 462    the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 463  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 464    If False, only the file name is stored.
 465  
 466  #### to_dict
 467  
 468  ```python
 469  to_dict() -> dict[str, Any]
 470  ```
 471  
 472  Serializes the component to a dictionary.
 473  
 474  **Returns:**
 475  
 476  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 477  
 478  #### from_dict
 479  
 480  ```python
 481  from_dict(data: dict[str, Any]) -> HTMLToDocument
 482  ```
 483  
 484  Deserializes the component from a dictionary.
 485  
 486  **Parameters:**
 487  
 488  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 489  
 490  **Returns:**
 491  
 492  - <code>HTMLToDocument</code> – The deserialized component.
 493  
 494  #### run
 495  
 496  ```python
 497  run(
 498      sources: list[str | Path | ByteStream],
 499      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 500      extraction_kwargs: dict[str, Any] | None = None,
 501  ) -> dict[str, Any]
 502  ```
 503  
 504  Converts a list of HTML files to Documents.
 505  
 506  **Parameters:**
 507  
 508  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
 509  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
 510    This value can be either a list of dictionaries or a single dictionary.
 511    If it's a single dictionary, its content is added to the metadata of all produced Documents.
 512    If it's a list, the length of the list must match the number of sources, because the two lists will
 513    be zipped.
 514    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 515  - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process.
 516  
 517  **Returns:**
 518  
 519  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
 520  - `documents`: Created Documents
 521  
 522  ## image/document_to_image
 523  
 524  ### DocumentToImageContent
 525  
 526  Converts documents sourced from PDF and image files into ImageContents.
 527  
 528  This component processes a list of documents and extracts visual content from supported file formats, converting
 529  them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF
 530  documents by extracting specific pages as images.
 531  
 532  Documents are expected to have metadata containing:
 533  
 534  - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path`
 535  - A supported image format (MIME type must be one of the supported image types)
 536  - For PDF files, a `page_number` key specifying which page to extract
 537  
 538  ### Usage example
 539  
 540  <!-- test-ignore -->
 541  
 542  ```python
 543  from haystack import Document
 544  from haystack.components.converters.image.document_to_image import DocumentToImageContent
 545  
 546  converter = DocumentToImageContent(
 547      file_path_meta_field="file_path",
 548      root_path="/data/files",
 549      detail="high",
 550      size=(800, 600)
 551  )
 552  
 553  documents = [
 554      Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}),
 555      Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1})
 556  ]
 557  
 558  result = converter.run(documents)
 559  image_contents = result["image_contents"]
 560  # [ImageContent(
 561  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'}
 562  #  ),
 563  #  ImageContent(
 564  #    base64_image='/9j/4A...', mime_type='image/jpeg', detail='high',
 565  #    meta={'page_number': 1, 'file_path': 'doc.pdf'}
 566  #  )]
 567  ```
 568  
 569  #### __init__
 570  
 571  ```python
 572  __init__(
 573      *,
 574      file_path_meta_field: str = "file_path",
 575      root_path: str | None = None,
 576      detail: Literal["auto", "high", "low"] | None = None,
 577      size: tuple[int, int] | None = None
 578  ) -> None
 579  ```
 580  
 581  Initialize the DocumentToImageContent component.
 582  
 583  **Parameters:**
 584  
 585  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
 586  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
 587    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 588  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
 589    This will be passed to the created ImageContent objects.
 590  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 591    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 592    when working with models that have resolution constraints or when transmitting images to remote services.
 593  
 594  #### run
 595  
 596  ```python
 597  run(documents: list[Document]) -> dict[str, list[ImageContent | None]]
 598  ```
 599  
 600  Convert documents with image or PDF sources into ImageContent objects.
 601  
 602  This method processes the input documents, extracting images from supported file formats and converting them
 603  into ImageContent objects.
 604  
 605  **Parameters:**
 606  
 607  - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum
 608    a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which
 609    page to convert.
 610  
 611  **Returns:**
 612  
 613  - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key:
 614  - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image
 615    data and metadata. The order corresponds to order of input documents.
 616  
 617  **Raises:**
 618  
 619  - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported
 620    MIME type. The error message will specify which document and what information is missing or incorrect.
 621  
 622  ## image/file_to_document
 623  
 624  ### ImageFileToDocument
 625  
 626  Converts image file references into empty Document objects with associated metadata.
 627  
 628  This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be
 629  processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`.
 630  
 631  It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as
 632  their content and attaches metadata such as file path and any user-provided values.
 633  
 634  ### Usage example
 635  
 636  ```python
 637  from haystack.components.converters.image import ImageFileToDocument
 638  
 639  converter = ImageFileToDocument()
 640  
 641  sources = ["image.jpg", "another_image.png"]
 642  
 643  result = converter.run(sources=sources)
 644  documents = result["documents"]
 645  
 646  print(documents)
 647  
 648  # [Document(id=..., meta: {'file_path': 'image.jpg'}),
 649  # Document(id=..., meta: {'file_path': 'another_image.png'})]
 650  ```
 651  
 652  #### __init__
 653  
 654  ```python
 655  __init__(*, store_full_path: bool = False) -> None
 656  ```
 657  
 658  Initialize the ImageFileToDocument component.
 659  
 660  **Parameters:**
 661  
 662  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 663    If False, only the file name is stored.
 664  
 665  #### run
 666  
 667  ```python
 668  run(
 669      *,
 670      sources: list[str | Path | ByteStream],
 671      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 672  ) -> dict[str, list[Document]]
 673  ```
 674  
 675  Convert image files into empty Document objects with metadata.
 676  
 677  This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects
 678  without content. These documents are enriched with metadata derived from the input source and optional
 679  user-provided metadata.
 680  
 681  **Parameters:**
 682  
 683  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 684  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
 685    This value can be a list of dictionaries or a single dictionary.
 686    If it's a single dictionary, its content is added to the metadata of all produced documents.
 687    If it's a list, its length must match the number of sources, as they are zipped together.
 688    For ByteStream objects, their `meta` is added to the output documents.
 689  
 690  **Returns:**
 691  
 692  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing:
 693  - `documents`: A list of `Document` objects with empty content and associated metadata.
 694  
 695  ## image/file_to_image
 696  
 697  ### ImageFileToImageContent
 698  
 699  Converts image files to ImageContent objects.
 700  
 701  ### Usage example
 702  
 703  ```python
 704  from haystack.components.converters.image import ImageFileToImageContent
 705  
 706  converter = ImageFileToImageContent()
 707  
 708  sources = ["image.jpg", "another_image.png"]
 709  
 710  image_contents = converter.run(sources=sources)["image_contents"]
 711  print(image_contents)
 712  
 713  # [ImageContent(base64_image='...',
 714  #               mime_type='image/jpeg',
 715  #               detail=None,
 716  #               meta={'file_path': 'image.jpg'}),
 717  #  ...]
 718  ```
 719  
 720  #### __init__
 721  
 722  ```python
 723  __init__(
 724      *,
 725      detail: Literal["auto", "high", "low"] | None = None,
 726      size: tuple[int, int] | None = None
 727  ) -> None
 728  ```
 729  
 730  Create the ImageFileToImageContent component.
 731  
 732  **Parameters:**
 733  
 734  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 735    This will be passed to the created ImageContent objects.
 736  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 737    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 738    when working with models that have resolution constraints or when transmitting images to remote services.
 739  
 740  #### run
 741  
 742  ```python
 743  run(
 744      sources: list[str | Path | ByteStream],
 745      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 746      *,
 747      detail: Literal["auto", "high", "low"] | None = None,
 748      size: tuple[int, int] | None = None
 749  ) -> dict[str, list[ImageContent]]
 750  ```
 751  
 752  Converts files to ImageContent objects.
 753  
 754  **Parameters:**
 755  
 756  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 757  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 758    This value can be a list of dictionaries or a single dictionary.
 759    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 760    If it's a list, its length must match the number of sources as they're zipped together.
 761    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 762  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 763    This will be passed to the created ImageContent objects.
 764    If not provided, the detail level will be the one set in the constructor.
 765  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 766    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 767    when working with models that have resolution constraints or when transmitting images to remote services.
 768    If not provided, the size value will be the one set in the constructor.
 769  
 770  **Returns:**
 771  
 772  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 773  - `image_contents`: A list of ImageContent objects.
 774  
 775  ## image/pdf_to_image
 776  
 777  ### PDFToImageContent
 778  
 779  Converts PDF files to ImageContent objects.
 780  
 781  ### Usage example
 782  
 783  ```python
 784  from haystack.components.converters.image import PDFToImageContent
 785  
 786  converter = PDFToImageContent()
 787  
 788  sources = ["file.pdf", "another_file.pdf"]
 789  
 790  image_contents = converter.run(sources=sources)["image_contents"]
 791  print(image_contents)
 792  
 793  # [ImageContent(base64_image='...',
 794  #               mime_type='application/pdf',
 795  #               detail=None,
 796  #               meta={'file_path': 'file.pdf', 'page_number': 1}),
 797  #  ...]
 798  ```
 799  
 800  #### __init__
 801  
 802  ```python
 803  __init__(
 804      *,
 805      detail: Literal["auto", "high", "low"] | None = None,
 806      size: tuple[int, int] | None = None,
 807      page_range: list[str | int] | None = None
 808  ) -> None
 809  ```
 810  
 811  Create the PDFToImageContent component.
 812  
 813  **Parameters:**
 814  
 815  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 816    This will be passed to the created ImageContent objects.
 817  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 818    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 819    when working with models that have resolution constraints or when transmitting images to remote services.
 820  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 821    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 822    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 823    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 824    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 825  
 826  #### run
 827  
 828  ```python
 829  run(
 830      sources: list[str | Path | ByteStream],
 831      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 832      *,
 833      detail: Literal["auto", "high", "low"] | None = None,
 834      size: tuple[int, int] | None = None,
 835      page_range: list[str | int] | None = None
 836  ) -> dict[str, list[ImageContent]]
 837  ```
 838  
 839  Converts files to ImageContent objects.
 840  
 841  **Parameters:**
 842  
 843  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
 844  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects.
 845    This value can be a list of dictionaries or a single dictionary.
 846    If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects.
 847    If it's a list, its length must match the number of sources as they're zipped together.
 848    For ByteStream objects, their `meta` is added to the output ImageContent objects.
 849  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".
 850    This will be passed to the created ImageContent objects.
 851    If not provided, the detail level will be the one set in the constructor.
 852  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while
 853    maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 854    when working with models that have resolution constraints or when transmitting images to remote services.
 855    If not provided, the size value will be the one set in the constructor.
 856  - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1.
 857    If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages)
 858    will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third
 859    pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12']
 860    will convert pages 1, 2, 3, 5, 8, 10, 11, 12.
 861    If not provided, the page_range value will be the one set in the constructor.
 862  
 863  **Returns:**
 864  
 865  - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys:
 866  - `image_contents`: A list of ImageContent objects.
 867  
 868  ## json
 869  
 870  ### JSONConverter
 871  
 872  Converts one or more JSON files into a text document.
 873  
 874  ### Usage examples
 875  
 876  ```python
 877  import json
 878  
 879  from haystack.components.converters import JSONConverter
 880  from haystack.dataclasses import ByteStream
 881  
 882  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 883  
 884  converter = JSONConverter(content_key="text")
 885  results = converter.run(sources=[source])
 886  documents = results["documents"]
 887  print(documents[0].content)
 888  # 'This is the content of my document'
 889  ```
 890  
 891  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 892  to extract from the filtered data:
 893  
 894  ```python
 895  import json
 896  
 897  from haystack.components.converters import JSONConverter
 898  from haystack.dataclasses import ByteStream
 899  
 900  data = {
 901      "laureates": [
 902          {
 903              "firstname": "Enrico",
 904              "surname": "Fermi",
 905              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 906              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 907              " slow neutrons",
 908          },
 909          {
 910              "firstname": "Rita",
 911              "surname": "Levi-Montalcini",
 912              "motivation": "for their discoveries of growth factors",
 913          },
 914      ],
 915  }
 916  source = ByteStream.from_string(json.dumps(data))
 917  converter = JSONConverter(
 918      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 919  )
 920  
 921  results = converter.run(sources=[source])
 922  documents = results["documents"]
 923  print(documents[0].content)
 924  # 'for his demonstrations of the existence of new radioactive elements produced by
 925  # neutron irradiation, and for his related discovery of nuclear reactions brought
 926  # about by slow neutrons'
 927  
 928  print(documents[0].meta)
 929  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 930  
 931  print(documents[1].content)
 932  # 'for their discoveries of growth factors'
 933  
 934  print(documents[1].meta)
 935  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 936  ```
 937  
 938  #### __init__
 939  
 940  ```python
 941  __init__(
 942      jq_schema: str | None = None,
 943      content_key: str | None = None,
 944      extra_meta_fields: set[str] | Literal["*"] | None = None,
 945      store_full_path: bool = False,
 946  ) -> None
 947  ```
 948  
 949  Creates a JSONConverter component.
 950  
 951  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 952  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 953  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 954  
 955  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 956  be set as the document's content.
 957  
 958  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 959  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 960  
 961  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 962  it will be skipped.
 963  
 964  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 965  
 966  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 967  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 968  the extracted documents. If a field is not found, the meta value will be `None`.
 969  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 970  be saved as metadata.
 971  
 972  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 973  
 974  **Parameters:**
 975  
 976  - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content.
 977    If not specified, whole JSON object will be used to extract information.
 978  - **content_key** (<code>str | None</code>) – Optional key to extract document content.
 979    If `jq_schema` is specified, the `content_key` will be extracted from that object.
 980  - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content.
 981    If `jq_schema` is specified, all keys will be extracted from that object.
 982  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
 983    If False, only the file name is stored.
 984  
 985  #### to_dict
 986  
 987  ```python
 988  to_dict() -> dict[str, Any]
 989  ```
 990  
 991  Serializes the component to a dictionary.
 992  
 993  **Returns:**
 994  
 995  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 996  
 997  #### from_dict
 998  
 999  ```python
1000  from_dict(data: dict[str, Any]) -> JSONConverter
1001  ```
1002  
1003  Deserializes the component from a dictionary.
1004  
1005  **Parameters:**
1006  
1007  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
1008  
1009  **Returns:**
1010  
1011  - <code>JSONConverter</code> – Deserialized component.
1012  
1013  #### run
1014  
1015  ```python
1016  run(
1017      sources: list[str | Path | ByteStream],
1018      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1019  ) -> dict[str, Any]
1020  ```
1021  
1022  Converts a list of JSON files to documents.
1023  
1024  **Parameters:**
1025  
1026  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects.
1027  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1028    This value can be either a list of dictionaries or a single dictionary.
1029    If it's a single dictionary, its content is added to the metadata of all produced documents.
1030    If it's a list, the length of the list must match the number of sources.
1031    If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
1032  
1033  **Returns:**
1034  
1035  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1036  - `documents`: A list of created documents.
1037  
1038  ## markdown
1039  
1040  ### MarkdownToDocument
1041  
1042  Converts a Markdown file into a text Document.
1043  
1044  Usage example:
1045  
1046  ```python
1047  from haystack.components.converters import MarkdownToDocument
1048  from datetime import datetime
1049  
1050  converter = MarkdownToDocument()
1051  results = converter.run(
1052      sources=["test/test_files/markdown/sample.md"], meta={"date_added": datetime.now().isoformat()}
1053  )
1054  documents = results["documents"]
1055  print(documents[0].content)
1056  # 'This is a text from the markdown file.'
1057  ```
1058  
1059  #### __init__
1060  
1061  ```python
1062  __init__(
1063      table_to_single_line: bool = False,
1064      progress_bar: bool = True,
1065      store_full_path: bool = False,
1066  ) -> None
1067  ```
1068  
1069  Create a MarkdownToDocument component.
1070  
1071  **Parameters:**
1072  
1073  - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line.
1074  - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running.
1075  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1076    If False, only the file name is stored.
1077  
1078  #### run
1079  
1080  ```python
1081  run(
1082      sources: list[str | Path | ByteStream],
1083      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1084  ) -> dict[str, Any]
1085  ```
1086  
1087  Converts a list of Markdown files to Documents.
1088  
1089  **Parameters:**
1090  
1091  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1092  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1093    This value can be either a list of dictionaries or a single dictionary.
1094    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1095    If it's a list, the length of the list must match the number of sources, because the two lists will
1096    be zipped.
1097    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1098  
1099  **Returns:**
1100  
1101  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1102  - `documents`: List of created Documents
1103  
1104  ## msg
1105  
1106  ### MSGToDocument
1107  
1108  Converts Microsoft Outlook .msg files into Haystack Documents.
1109  
1110  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
1111  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
1112  file are extracted as ByteStream objects.
1113  
1114  ### Example Usage
1115  
1116  ```python
1117  from haystack.components.converters.msg import MSGToDocument
1118  from datetime import datetime
1119  
1120  converter = MSGToDocument()
1121  results = converter.run(sources=["test/test_files/msg/sample.msg"], meta={"date_added": datetime.now().isoformat()})
1122  documents = results["documents"]
1123  attachments = results["attachments"]
1124  print(documents[0].content)
1125  ```
1126  
1127  #### __init__
1128  
1129  ```python
1130  __init__(store_full_path: bool = False) -> None
1131  ```
1132  
1133  Creates a MSGToDocument component.
1134  
1135  **Parameters:**
1136  
1137  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1138    If False, only the file name is stored.
1139  
1140  #### run
1141  
1142  ```python
1143  run(
1144      sources: list[str | Path | ByteStream],
1145      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1146  ) -> dict[str, list[Document] | list[ByteStream]]
1147  ```
1148  
1149  Converts MSG files to Documents.
1150  
1151  **Parameters:**
1152  
1153  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1154  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1155    This value can be either a list of dictionaries or a single dictionary.
1156    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1157    If it's a list, the length of the list must match the number of sources, because the two lists will
1158    be zipped.
1159    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1160  
1161  **Returns:**
1162  
1163  - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys:
1164  - `documents`: Created Documents.
1165  - `attachments`: Created ByteStream objects from file attachments.
1166  
1167  ## multi_file_converter
1168  
1169  ### MultiFileConverter
1170  
1171  A file converter that handles conversion of multiple file types.
1172  
1173  The MultiFileConverter handles the following file types:
1174  
1175  - CSV
1176  - DOCX
1177  - HTML
1178  - JSON
1179  - MD
1180  - TEXT
1181  - PDF (no OCR)
1182  - PPTX
1183  - XLSX
1184  
1185  Usage example:
1186  
1187  ```
1188  from haystack.super_components.converters import MultiFileConverter
1189  
1190  converter = MultiFileConverter()
1191  converter.run(sources=["test/test_files/txt/doc_1.txt", "test/test_files/pdf/sample_pdf_1.pdf"], meta={})
1192  ```
1193  
1194  #### __init__
1195  
1196  ```python
1197  __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None
1198  ```
1199  
1200  Initialize the MultiFileConverter.
1201  
1202  **Parameters:**
1203  
1204  - **encoding** (<code>str</code>) – The encoding to use when reading files.
1205  - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files.
1206  
1207  ## openapi_functions
1208  
1209  ### OpenAPIServiceToFunctions
1210  
1211  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
1212  
1213  The definition must respect OpenAPI specification 3.0.0 or higher.
1214  It can be specified in JSON or YAML format.
1215  Each function must have:
1216  \- unique operationId
1217  \- description
1218  \- requestBody and/or parameters
1219  \- schema for the requestBody and/or parameters
1220  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
1221  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
1222  
1223  Usage example:
1224  
1225  ```python
1226  from haystack.components.converters import OpenAPIServiceToFunctions
1227  from haystack.dataclasses.byte_stream import ByteStream
1228  
1229  converter = OpenAPIServiceToFunctions()
1230  spec = ByteStream.from_string(
1231      '{"openapi":"3.0.0","info":{"title":"API","version":"1.0.0"},"paths":{"/search":{"get":{"operationId":"search","summary":"Search","parameters":[{"name":"q","in":"query","required":true,"schema":{"type":"string"}}]}}}}'
1232  )
1233  result = converter.run(sources=[spec])
1234  assert result["functions"]
1235  ```
1236  
1237  #### __init__
1238  
1239  ```python
1240  __init__() -> None
1241  ```
1242  
1243  Create an OpenAPIServiceToFunctions component.
1244  
1245  #### run
1246  
1247  ```python
1248  run(sources: list[str | Path | ByteStream]) -> dict[str, Any]
1249  ```
1250  
1251  Converts OpenAPI definitions in OpenAI function calling format.
1252  
1253  **Parameters:**
1254  
1255  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
1256  
1257  **Returns:**
1258  
1259  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1260  - functions: Function definitions in JSON object format
1261  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
1262  
1263  **Raises:**
1264  
1265  - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed.
1266  - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions.
1267  
1268  ## output_adapter
1269  
1270  ### OutputAdaptationException
1271  
1272  Bases: <code>Exception</code>
1273  
1274  Exception raised when there is an error during output adaptation.
1275  
1276  ### OutputAdapter
1277  
1278  Adapts output of a Component using Jinja templates.
1279  
1280  Usage example:
1281  
1282  ```python
1283  from haystack import Document
1284  from haystack.components.converters import OutputAdapter
1285  
1286  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
1287  documents = [Document(content="Test content")]
1288  result = adapter.run(documents=documents)
1289  
1290  assert result["output"] == "Test content"
1291  ```
1292  
1293  #### __init__
1294  
1295  ```python
1296  __init__(
1297      template: str,
1298      output_type: TypeAlias,
1299      custom_filters: dict[str, Callable] | None = None,
1300      unsafe: bool = False,
1301  ) -> None
1302  ```
1303  
1304  Create an OutputAdapter component.
1305  
1306  **Parameters:**
1307  
1308  - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data.
1309    The variables in the template define the input of this instance.
1310    e.g.
1311    With this template:
1312  
1313  ```
1314  {{ documents[0].content }}
1315  ```
1316  
1317  The Component input will be `documents`.
1318  
1319  - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return.
1320  - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template.
1321  - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template.
1322    This should only be used if you trust the source of the template as it can be lead to remote code execution.
1323  
1324  #### run
1325  
1326  ```python
1327  run(**kwargs: Any) -> dict[str, Any]
1328  ```
1329  
1330  Renders the Jinja template with the provided inputs.
1331  
1332  **Parameters:**
1333  
1334  - **kwargs** (<code>Any</code>) – Must contain all variables used in the `template` string.
1335  
1336  **Returns:**
1337  
1338  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1339  - `output`: Rendered Jinja template.
1340  
1341  **Raises:**
1342  
1343  - <code>OutputAdaptationException</code> – If template rendering fails.
1344  
1345  #### to_dict
1346  
1347  ```python
1348  to_dict() -> dict[str, Any]
1349  ```
1350  
1351  Serializes the component to a dictionary.
1352  
1353  **Returns:**
1354  
1355  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1356  
1357  #### from_dict
1358  
1359  ```python
1360  from_dict(data: dict[str, Any]) -> OutputAdapter
1361  ```
1362  
1363  Deserializes the component from a dictionary.
1364  
1365  **Parameters:**
1366  
1367  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
1368  
1369  **Returns:**
1370  
1371  - <code>OutputAdapter</code> – The deserialized component.
1372  
1373  ## pdfminer
1374  
1375  ### PDFMinerToDocument
1376  
1377  Converts PDF files to Documents.
1378  
1379  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1380  
1381  Usage example:
1382  
1383  ```python
1384  from haystack.components.converters.pdfminer import PDFMinerToDocument
1385  from datetime import datetime
1386  
1387  converter = PDFMinerToDocument()
1388  results = converter.run(
1389      sources=["test/test_files/pdf/sample_pdf_1.pdf"], meta={"date_added": datetime.now().isoformat()}
1390  )
1391  
1392  print(results["documents"][0].content)
1393  # >> 'This is a text from the PDF file.'
1394  ```
1395  
1396  #### __init__
1397  
1398  ```python
1399  __init__(
1400      line_overlap: float = 0.5,
1401      char_margin: float = 2.0,
1402      line_margin: float = 0.5,
1403      word_margin: float = 0.1,
1404      boxes_flow: float | None = 0.5,
1405      detect_vertical: bool = True,
1406      all_texts: bool = False,
1407      store_full_path: bool = False,
1408  ) -> None
1409  ```
1410  
1411  Create a PDFMinerToDocument component.
1412  
1413  **Parameters:**
1414  
1415  - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on
1416    the same line based on the amount of overlap between them.
1417    The overlap is calculated relative to the minimum height of both characters.
1418  - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them.
1419    If the distance is less than the margin specified, the characters are considered to be on the same line.
1420    The margin is calculated relative to the width of the character.
1421  - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word
1422    based on the distance between them. If the distance is greater than the margin specified,
1423    an intermediate space will be added between them to make the text more readable.
1424    The margin is calculated relative to the width of the character.
1425  - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on
1426    the distance between them. If the distance is less than the margin specified,
1427    the lines are considered to be part of the same paragraph.
1428    The margin is calculated relative to the height of a line.
1429  - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when
1430    determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1431    with -1.0 indicating that only horizontal position matters and +1.0 indicating
1432    that only vertical position matters. Setting the value to 'None' will disable advanced
1433    layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1434  - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis.
1435  - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures.
1436  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1437    If False, only the file name is stored.
1438  
1439  #### detect_undecoded_cid_characters
1440  
1441  ```python
1442  detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1443  ```
1444  
1445  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1446  
1447  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1448  non-standard fonts.
1449  
1450  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1451  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1452  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1453  as is.
1454  
1455  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1456  
1457  **Parameters:**
1458  
1459  - **text** (<code>str</code>) – The text to check for undecoded CID characters
1460  
1461  **Returns:**
1462  
1463  - <code>dict\[str, Any\]</code> – A dictionary containing detection results
1464  
1465  #### run
1466  
1467  ```python
1468  run(
1469      sources: list[str | Path | ByteStream],
1470      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1471  ) -> dict[str, Any]
1472  ```
1473  
1474  Converts PDF files to Documents.
1475  
1476  **Parameters:**
1477  
1478  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects.
1479  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1480    This value can be either a list of dictionaries or a single dictionary.
1481    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1482    If it's a list, the length of the list must match the number of sources, because the two lists will
1483    be zipped.
1484    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1485  
1486  **Returns:**
1487  
1488  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1489  - `documents`: Created Documents
1490  
1491  ## pptx
1492  
1493  ### PPTXToDocument
1494  
1495  Converts PPTX files to Documents.
1496  
1497  Usage example:
1498  
1499  ```python
1500  from haystack.components.converters.pptx import PPTXToDocument
1501  from datetime import datetime
1502  
1503  converter = PPTXToDocument()
1504  results = converter.run(
1505      sources=["test/test_files/pptx/sample_pptx.pptx"], meta={"date_added": datetime.now().isoformat()}
1506  )
1507  documents = results["documents"]
1508  
1509  print(documents[0].content)
1510  # >> 'This is the text from the PPTX file.'
1511  ```
1512  
1513  #### __init__
1514  
1515  ```python
1516  __init__(
1517      store_full_path: bool = False,
1518      link_format: Literal["markdown", "plain", "none"] = "none",
1519  ) -> None
1520  ```
1521  
1522  Create a PPTXToDocument component.
1523  
1524  **Parameters:**
1525  
1526  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1527    If False, only the file name is stored.
1528  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1529  - `"markdown"`: `[text](url)`
1530  - `"plain"`: `text (url)`
1531  - `"none"`: Only the text is extracted, link addresses are ignored.
1532  
1533  #### to_dict
1534  
1535  ```python
1536  to_dict() -> dict[str, Any]
1537  ```
1538  
1539  Serializes the component to a dictionary.
1540  
1541  **Returns:**
1542  
1543  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1544  
1545  #### run
1546  
1547  ```python
1548  run(
1549      sources: list[str | Path | ByteStream],
1550      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1551  ) -> dict[str, Any]
1552  ```
1553  
1554  Converts PPTX files to Documents.
1555  
1556  **Parameters:**
1557  
1558  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1559  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1560    This value can be either a list of dictionaries or a single dictionary.
1561    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1562    If it's a list, the length of the list must match the number of sources, because the two lists will
1563    be zipped.
1564    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1565  
1566  **Returns:**
1567  
1568  - <code>dict\[str, Any\]</code> – A dictionary with the following keys:
1569  - `documents`: Created Documents
1570  
1571  ## pypdf
1572  
1573  ### PyPDFExtractionMode
1574  
1575  Bases: <code>Enum</code>
1576  
1577  The mode to use for extracting text from a PDF.
1578  
1579  #### from_str
1580  
1581  ```python
1582  from_str(string: str) -> PyPDFExtractionMode
1583  ```
1584  
1585  Convert a string to a PyPDFExtractionMode enum.
1586  
1587  ### PyPDFToDocument
1588  
1589  Converts PDF files to documents your pipeline can query.
1590  
1591  This component uses the PyPDF library.
1592  You can attach metadata to the resulting documents.
1593  
1594  ### Usage example
1595  
1596  ```python
1597  from haystack.components.converters.pypdf import PyPDFToDocument
1598  from datetime import datetime
1599  
1600  converter = PyPDFToDocument()
1601  results = converter.run(
1602      sources=["test/test_files/pdf/sample_pdf_1.pdf"], meta={"date_added": datetime.now().isoformat()}
1603  )
1604  documents = results["documents"]
1605  
1606  print(documents[0].content)
1607  # >> 'This is a text from the PDF file.'
1608  ```
1609  
1610  #### __init__
1611  
1612  ```python
1613  __init__(
1614      *,
1615      extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN,
1616      plain_mode_orientations: tuple = (0, 90, 180, 270),
1617      plain_mode_space_width: float = 200.0,
1618      layout_mode_space_vertically: bool = True,
1619      layout_mode_scale_weight: float = 1.25,
1620      layout_mode_strip_rotated: bool = True,
1621      layout_mode_font_height_weight: float = 1.0,
1622      store_full_path: bool = False
1623  ) -> None
1624  ```
1625  
1626  Create an PyPDFToDocument component.
1627  
1628  **Parameters:**
1629  
1630  - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF.
1631    Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1632  - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode.
1633    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1634  - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font.
1635    Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1636  - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height.
1637    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1638  - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width.
1639    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1640  - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1641    If rotated text is discovered, layout will be degraded and a warning will be logged.
1642    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1643  - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height.
1644    Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1645  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1646    If False, only the file name is stored.
1647  
1648  #### to_dict
1649  
1650  ```python
1651  to_dict() -> dict[str, Any]
1652  ```
1653  
1654  Serializes the component to a dictionary.
1655  
1656  **Returns:**
1657  
1658  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1659  
1660  #### from_dict
1661  
1662  ```python
1663  from_dict(data: dict[str, Any]) -> PyPDFToDocument
1664  ```
1665  
1666  Deserializes the component from a dictionary.
1667  
1668  **Parameters:**
1669  
1670  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
1671  
1672  **Returns:**
1673  
1674  - <code>PyPDFToDocument</code> – Deserialized component.
1675  
1676  #### run
1677  
1678  ```python
1679  run(
1680      sources: list[str | Path | ByteStream],
1681      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1682  ) -> dict[str, list[Document]]
1683  ```
1684  
1685  Converts PDF files to documents.
1686  
1687  **Parameters:**
1688  
1689  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert.
1690  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1691    This value can be a list of dictionaries or a single dictionary.
1692    If it's a single dictionary, its content is added to the metadata of all produced documents.
1693    If it's a list, its length must match the number of sources, as they are zipped together.
1694    For ByteStream objects, their `meta` is added to the output documents.
1695  
1696  **Returns:**
1697  
1698  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1699  - `documents`: A list of converted documents.
1700  
1701  ## tika
1702  
1703  ### XHTMLParser
1704  
1705  Bases: <code>HTMLParser</code>
1706  
1707  Custom parser to extract pages from Tika XHTML content.
1708  
1709  #### handle_starttag
1710  
1711  ```python
1712  handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None
1713  ```
1714  
1715  Identify the start of a page div.
1716  
1717  #### handle_endtag
1718  
1719  ```python
1720  handle_endtag(tag: str) -> None
1721  ```
1722  
1723  Identify the end of a page div.
1724  
1725  #### handle_data
1726  
1727  ```python
1728  handle_data(data: str) -> None
1729  ```
1730  
1731  Populate the page content.
1732  
1733  ### TikaDocumentConverter
1734  
1735  Converts files of different types to Documents.
1736  
1737  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1738  requires a running Tika server.
1739  For more options on running Tika,
1740  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1741  
1742  Usage example:
1743  
1744  <!-- test-ignore -->
1745  
1746  ```python
1747  from haystack.components.converters.tika import TikaDocumentConverter
1748  from datetime import datetime
1749  
1750  converter = TikaDocumentConverter()
1751  results = converter.run(
1752      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1753      meta={"date_added": datetime.now().isoformat()}
1754  )
1755  documents = results["documents"]
1756  
1757  print(documents[0].content)
1758  # >> 'This is a text from the docx file.'
1759  ```
1760  
1761  #### __init__
1762  
1763  ```python
1764  __init__(
1765      tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
1766  ) -> None
1767  ```
1768  
1769  Create a TikaDocumentConverter component.
1770  
1771  **Parameters:**
1772  
1773  - **tika_url** (<code>str</code>) – Tika server URL.
1774  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1775    If False, only the file name is stored.
1776  
1777  #### run
1778  
1779  ```python
1780  run(
1781      sources: list[str | Path | ByteStream],
1782      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1783  ) -> dict[str, list[Document]]
1784  ```
1785  
1786  Converts files to Documents.
1787  
1788  **Parameters:**
1789  
1790  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects.
1791  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
1792    This value can be either a list of dictionaries or a single dictionary.
1793    If it's a single dictionary, its content is added to the metadata of all produced Documents.
1794    If it's a list, the length of the list must match the number of sources, because the two lists will
1795    be zipped.
1796    If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1797  
1798  **Returns:**
1799  
1800  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1801  - `documents`: Created Documents
1802  
1803  ## txt
1804  
1805  ### TextFileToDocument
1806  
1807  Converts text files to documents your pipeline can query.
1808  
1809  By default, it uses UTF-8 encoding when converting files but
1810  you can also set custom encoding.
1811  It can attach metadata to the resulting documents.
1812  
1813  ### Usage example
1814  
1815  ```python
1816  from haystack.components.converters.txt import TextFileToDocument
1817  
1818  converter = TextFileToDocument()
1819  results = converter.run(sources=["test/test_files/txt/doc_1.txt"])
1820  documents = results["documents"]
1821  
1822  print(documents[0].content)
1823  # >> 'This is the content from the txt file.'
1824  ```
1825  
1826  #### __init__
1827  
1828  ```python
1829  __init__(encoding: str = 'utf-8', store_full_path: bool = False) -> None
1830  ```
1831  
1832  Creates a TextFileToDocument component.
1833  
1834  **Parameters:**
1835  
1836  - **encoding** (<code>str</code>) – The encoding of the text files to convert.
1837    If the encoding is specified in the metadata of a source ByteStream,
1838    it overrides this value.
1839  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1840    If False, only the file name is stored.
1841  
1842  #### run
1843  
1844  ```python
1845  run(
1846      sources: list[str | Path | ByteStream],
1847      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1848  ) -> dict[str, list[Document]]
1849  ```
1850  
1851  Converts text files to documents.
1852  
1853  **Parameters:**
1854  
1855  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert.
1856  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1857    This value can be a list of dictionaries or a single dictionary.
1858    If it's a single dictionary, its content is added to the metadata of all produced documents.
1859    If it's a list, its length must match the number of sources as they're zipped together.
1860    For ByteStream objects, their `meta` is added to the output documents.
1861  
1862  **Returns:**
1863  
1864  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1865  - `documents`: A list of converted documents.
1866  
1867  ## xlsx
1868  
1869  ### XLSXToDocument
1870  
1871  Converts XLSX (Excel) files into Documents.
1872  
1873  Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1874  created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1875  
1876  ### Usage example
1877  
1878  ```python
1879  from haystack.components.converters.xlsx import XLSXToDocument
1880  from datetime import datetime
1881  
1882  converter = XLSXToDocument()
1883  results = converter.run(
1884      sources=["test/test_files/xlsx/basic_tables_two_sheets.xlsx"], meta={"date_added": datetime.now().isoformat()}
1885  )
1886  documents = results["documents"]
1887  
1888  print(documents[0].content)
1889  # >> ",A,B\n1,col_a,col_b\n2,1.5,test\n"
1890  ```
1891  
1892  #### __init__
1893  
1894  ```python
1895  __init__(
1896      table_format: Literal["csv", "markdown"] = "csv",
1897      sheet_name: str | int | list[str | int] | None = None,
1898      read_excel_kwargs: dict[str, Any] | None = None,
1899      table_format_kwargs: dict[str, Any] | None = None,
1900      *,
1901      link_format: Literal["markdown", "plain", "none"] = "none",
1902      store_full_path: bool = False
1903  ) -> None
1904  ```
1905  
1906  Creates a XLSXToDocument component.
1907  
1908  **Parameters:**
1909  
1910  - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to.
1911  - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read.
1912  - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`.
1913    See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1914  - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function.
1915  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1916    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1917  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1918    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1919  - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options:
1920  - `"markdown"`: `[text](url)`
1921  - `"plain"`: `text (url)`
1922  - `"none"`: Only the text is extracted, link addresses are ignored.
1923  - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
1924    If False, only the file name is stored.
1925  
1926  #### run
1927  
1928  ```python
1929  run(
1930      sources: list[str | Path | ByteStream],
1931      meta: dict[str, Any] | list[dict[str, Any]] | None = None,
1932  ) -> dict[str, list[Document]]
1933  ```
1934  
1935  Converts a XLSX file to a Document.
1936  
1937  **Parameters:**
1938  
1939  - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
1940  - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents.
1941    This value can be either a list of dictionaries or a single dictionary.
1942    If it's a single dictionary, its content is added to the metadata of all produced documents.
1943    If it's a list, the length of the list must match the number of sources, because the two lists will
1944    be zipped.
1945    If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1946  
1947  **Returns:**
1948  
1949  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
1950  - `documents`: Created documents