Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.18 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: Converters
   3  id: converters-api
   4  description: Various converters to transform data from one format to another.
   5  slug: "/converters-api"
   6  ---
   7  
   8  <a id="azure"></a>
   9  
  10  # Module azure
  11  
  12  <a id="azure.AzureOCRDocumentConverter"></a>
  13  
  14  ## AzureOCRDocumentConverter
  15  
  16  Converts files to documents using Azure's Document Intelligence service.
  17  
  18  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  19  
  20  To use this component, you need an active Azure account
  21  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  22  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  23  
  24  ### Usage example
  25  
  26  ```python
  27  from haystack.components.converters import AzureOCRDocumentConverter
  28  from haystack.utils import Secret
  29  
  30  converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
  31  results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
  32  documents = results["documents"]
  33  print(documents[0].content)
  34  # 'This is a text from the PDF file.'
  35  ```
  36  
  37  <a id="azure.AzureOCRDocumentConverter.__init__"></a>
  38  
  39  #### AzureOCRDocumentConverter.\_\_init\_\_
  40  
  41  ```python
  42  def __init__(endpoint: str,
  43               api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  44               model_id: str = "prebuilt-read",
  45               preceding_context_len: int = 3,
  46               following_context_len: int = 3,
  47               merge_multiple_column_headers: bool = True,
  48               page_layout: Literal["natural", "single_column"] = "natural",
  49               threshold_y: Optional[float] = 0.05,
  50               store_full_path: bool = False)
  51  ```
  52  
  53  Creates an AzureOCRDocumentConverter component.
  54  
  55  **Arguments**:
  56  
  57  - `endpoint`: The endpoint of your Azure resource.
  58  - `api_key`: The API key of your Azure resource.
  59  - `model_id`: The ID of the model you want to use. For a list of available models, see [Azure documentation]
  60  (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  61  - `preceding_context_len`: Number of lines before a table to include as preceding context
  62  (this will be added to the metadata).
  63  - `following_context_len`: Number of lines after a table to include as subsequent context (
  64  this will be added to the metadata).
  65  - `merge_multiple_column_headers`: If `True`, merges multiple column header rows into a single row.
  66  - `page_layout`: The type reading order to follow. Possible options:
  67  - `natural`: Uses the natural reading order determined by Azure.
  68  - `single_column`: Groups all lines with the same height on the page based on a threshold
  69  determined by `threshold_y`.
  70  - `threshold_y`: Only relevant if `single_column` is set to `page_layout`.
  71  The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  72  single line. This is crucial for section headers or numbers which may be spatially separated
  73  from the remaining text on the horizontal axis.
  74  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
  75  If False, only the file name is stored.
  76  
  77  <a id="azure.AzureOCRDocumentConverter.run"></a>
  78  
  79  #### AzureOCRDocumentConverter.run
  80  
  81  ```python
  82  @component.output_types(documents=list[Document],
  83                          raw_azure_response=list[dict])
  84  def run(sources: list[Union[str, Path, ByteStream]],
  85          meta: Optional[list[dict[str, Any]]] = None)
  86  ```
  87  
  88  Convert a list of files to Documents using Azure's Document Intelligence service.
  89  
  90  **Arguments**:
  91  
  92  - `sources`: List of file paths or ByteStream objects.
  93  - `meta`: Optional metadata to attach to the Documents.
  94  This value can be either a list of dictionaries or a single dictionary.
  95  If it's a single dictionary, its content is added to the metadata of all produced Documents.
  96  If it's a list, the length of the list must match the number of sources, because the two lists will be
  97  zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
  98  
  99  **Returns**:
 100  
 101  A dictionary with the following keys:
 102  - `documents`: List of created Documents
 103  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 104  
 105  <a id="azure.AzureOCRDocumentConverter.to_dict"></a>
 106  
 107  #### AzureOCRDocumentConverter.to\_dict
 108  
 109  ```python
 110  def to_dict() -> dict[str, Any]
 111  ```
 112  
 113  Serializes the component to a dictionary.
 114  
 115  **Returns**:
 116  
 117  Dictionary with serialized data.
 118  
 119  <a id="azure.AzureOCRDocumentConverter.from_dict"></a>
 120  
 121  #### AzureOCRDocumentConverter.from\_dict
 122  
 123  ```python
 124  @classmethod
 125  def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter"
 126  ```
 127  
 128  Deserializes the component from a dictionary.
 129  
 130  **Arguments**:
 131  
 132  - `data`: The dictionary to deserialize from.
 133  
 134  **Returns**:
 135  
 136  The deserialized component.
 137  
 138  <a id="csv"></a>
 139  
 140  # Module csv
 141  
 142  <a id="csv.CSVToDocument"></a>
 143  
 144  ## CSVToDocument
 145  
 146  Converts CSV files to Documents.
 147  
 148      By default, it uses UTF-8 encoding when converting files but
 149      you can also set a custom encoding.
 150      It can attach metadata to the resulting documents.
 151  
 152      ### Usage example
 153  
 154      ```python
 155      from haystack.components.converters.csv import CSVToDocument
 156      converter = CSVToDocument()
 157      results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
 158      documents = results["documents"]
 159      print(documents[0].content)
 160      # 'col1,col2
 161  ow1,row1
 162  row2row2
 163  '
 164      ```
 165  
 166  <a id="csv.CSVToDocument.__init__"></a>
 167  
 168  #### CSVToDocument.\_\_init\_\_
 169  
 170  ```python
 171  def __init__(encoding: str = "utf-8", store_full_path: bool = False)
 172  ```
 173  
 174  Creates a CSVToDocument component.
 175  
 176  **Arguments**:
 177  
 178  - `encoding`: The encoding of the csv files to convert.
 179  If the encoding is specified in the metadata of a source ByteStream,
 180  it overrides this value.
 181  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 182  If False, only the file name is stored.
 183  
 184  <a id="csv.CSVToDocument.run"></a>
 185  
 186  #### CSVToDocument.run
 187  
 188  ```python
 189  @component.output_types(documents=list[Document])
 190  def run(sources: list[Union[str, Path, ByteStream]],
 191          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 192  ```
 193  
 194  Converts a CSV file to a Document.
 195  
 196  **Arguments**:
 197  
 198  - `sources`: List of file paths or ByteStream objects.
 199  - `meta`: Optional metadata to attach to the documents.
 200  This value can be either a list of dictionaries or a single dictionary.
 201  If it's a single dictionary, its content is added to the metadata of all produced documents.
 202  If it's a list, the length of the list must match the number of sources, because the two lists will
 203  be zipped.
 204  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 205  
 206  **Returns**:
 207  
 208  A dictionary with the following keys:
 209  - `documents`: Created documents
 210  
 211  <a id="docx"></a>
 212  
 213  # Module docx
 214  
 215  <a id="docx.DOCXMetadata"></a>
 216  
 217  ## DOCXMetadata
 218  
 219  Describes the metadata of Docx file.
 220  
 221  **Arguments**:
 222  
 223  - `author`: The author
 224  - `category`: The category
 225  - `comments`: The comments
 226  - `content_status`: The content status
 227  - `created`: The creation date (ISO formatted string)
 228  - `identifier`: The identifier
 229  - `keywords`: Available keywords
 230  - `language`: The language of the document
 231  - `last_modified_by`: User who last modified the document
 232  - `last_printed`: The last printed date (ISO formatted string)
 233  - `modified`: The last modification date (ISO formatted string)
 234  - `revision`: The revision number
 235  - `subject`: The subject
 236  - `title`: The title
 237  - `version`: The version
 238  
 239  <a id="docx.DOCXTableFormat"></a>
 240  
 241  ## DOCXTableFormat
 242  
 243  Supported formats for storing DOCX tabular data in a Document.
 244  
 245  <a id="docx.DOCXTableFormat.from_str"></a>
 246  
 247  #### DOCXTableFormat.from\_str
 248  
 249  ```python
 250  @staticmethod
 251  def from_str(string: str) -> "DOCXTableFormat"
 252  ```
 253  
 254  Convert a string to a DOCXTableFormat enum.
 255  
 256  <a id="docx.DOCXLinkFormat"></a>
 257  
 258  ## DOCXLinkFormat
 259  
 260  Supported formats for storing DOCX link information in a Document.
 261  
 262  <a id="docx.DOCXLinkFormat.from_str"></a>
 263  
 264  #### DOCXLinkFormat.from\_str
 265  
 266  ```python
 267  @staticmethod
 268  def from_str(string: str) -> "DOCXLinkFormat"
 269  ```
 270  
 271  Convert a string to a DOCXLinkFormat enum.
 272  
 273  <a id="docx.DOCXToDocument"></a>
 274  
 275  ## DOCXToDocument
 276  
 277  Converts DOCX files to Documents.
 278  
 279  Uses `python-docx` library to convert the DOCX file to a document.
 280  This component does not preserve page breaks in the original document.
 281  
 282  Usage example:
 283  ```python
 284  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 285  
 286  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 287  results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
 288  documents = results["documents"]
 289  print(documents[0].content)
 290  # 'This is a text from the DOCX file.'
 291  ```
 292  
 293  <a id="docx.DOCXToDocument.__init__"></a>
 294  
 295  #### DOCXToDocument.\_\_init\_\_
 296  
 297  ```python
 298  def __init__(table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.CSV,
 299               link_format: Union[str, DOCXLinkFormat] = DOCXLinkFormat.NONE,
 300               store_full_path: bool = False)
 301  ```
 302  
 303  Create a DOCXToDocument component.
 304  
 305  **Arguments**:
 306  
 307  - `table_format`: The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 308  DOCXTableFormat.CSV, "markdown", or "csv".
 309  - `link_format`: The format for link output. Can be either:
 310  DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 311  DOCXLinkFormat.PLAIN or "plain" to get text (address),
 312  DOCXLinkFormat.NONE or "none" to get text without links.
 313  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 314  If False, only the file name is stored.
 315  
 316  <a id="docx.DOCXToDocument.to_dict"></a>
 317  
 318  #### DOCXToDocument.to\_dict
 319  
 320  ```python
 321  def to_dict() -> dict[str, Any]
 322  ```
 323  
 324  Serializes the component to a dictionary.
 325  
 326  **Returns**:
 327  
 328  Dictionary with serialized data.
 329  
 330  <a id="docx.DOCXToDocument.from_dict"></a>
 331  
 332  #### DOCXToDocument.from\_dict
 333  
 334  ```python
 335  @classmethod
 336  def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument"
 337  ```
 338  
 339  Deserializes the component from a dictionary.
 340  
 341  **Arguments**:
 342  
 343  - `data`: The dictionary to deserialize from.
 344  
 345  **Returns**:
 346  
 347  The deserialized component.
 348  
 349  <a id="docx.DOCXToDocument.run"></a>
 350  
 351  #### DOCXToDocument.run
 352  
 353  ```python
 354  @component.output_types(documents=list[Document])
 355  def run(sources: list[Union[str, Path, ByteStream]],
 356          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 357  ```
 358  
 359  Converts DOCX files to Documents.
 360  
 361  **Arguments**:
 362  
 363  - `sources`: List of file paths or ByteStream objects.
 364  - `meta`: Optional metadata to attach to the Documents.
 365  This value can be either a list of dictionaries or a single dictionary.
 366  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 367  If it's a list, the length of the list must match the number of sources, because the two lists will
 368  be zipped.
 369  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 370  
 371  **Returns**:
 372  
 373  A dictionary with the following keys:
 374  - `documents`: Created Documents
 375  
 376  <a id="html"></a>
 377  
 378  # Module html
 379  
 380  <a id="html.HTMLToDocument"></a>
 381  
 382  ## HTMLToDocument
 383  
 384  Converts an HTML file to a Document.
 385  
 386  Usage example:
 387  ```python
 388  from haystack.components.converters import HTMLToDocument
 389  
 390  converter = HTMLToDocument()
 391  results = converter.run(sources=["path/to/sample.html"])
 392  documents = results["documents"]
 393  print(documents[0].content)
 394  # 'This is a text from the HTML file.'
 395  ```
 396  
 397  <a id="html.HTMLToDocument.__init__"></a>
 398  
 399  #### HTMLToDocument.\_\_init\_\_
 400  
 401  ```python
 402  def __init__(extraction_kwargs: Optional[dict[str, Any]] = None,
 403               store_full_path: bool = False)
 404  ```
 405  
 406  Create an HTMLToDocument component.
 407  
 408  **Arguments**:
 409  
 410  - `extraction_kwargs`: A dictionary containing keyword arguments to customize the extraction process. These
 411  are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 412  the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 413  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 414  If False, only the file name is stored.
 415  
 416  <a id="html.HTMLToDocument.to_dict"></a>
 417  
 418  #### HTMLToDocument.to\_dict
 419  
 420  ```python
 421  def to_dict() -> dict[str, Any]
 422  ```
 423  
 424  Serializes the component to a dictionary.
 425  
 426  **Returns**:
 427  
 428  Dictionary with serialized data.
 429  
 430  <a id="html.HTMLToDocument.from_dict"></a>
 431  
 432  #### HTMLToDocument.from\_dict
 433  
 434  ```python
 435  @classmethod
 436  def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument"
 437  ```
 438  
 439  Deserializes the component from a dictionary.
 440  
 441  **Arguments**:
 442  
 443  - `data`: The dictionary to deserialize from.
 444  
 445  **Returns**:
 446  
 447  The deserialized component.
 448  
 449  <a id="html.HTMLToDocument.run"></a>
 450  
 451  #### HTMLToDocument.run
 452  
 453  ```python
 454  @component.output_types(documents=list[Document])
 455  def run(sources: list[Union[str, Path, ByteStream]],
 456          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None,
 457          extraction_kwargs: Optional[dict[str, Any]] = None)
 458  ```
 459  
 460  Converts a list of HTML files to Documents.
 461  
 462  **Arguments**:
 463  
 464  - `sources`: List of HTML file paths or ByteStream objects.
 465  - `meta`: Optional metadata to attach to the Documents.
 466  This value can be either a list of dictionaries or a single dictionary.
 467  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 468  If it's a list, the length of the list must match the number of sources, because the two lists will
 469  be zipped.
 470  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 471  - `extraction_kwargs`: Additional keyword arguments to customize the extraction process.
 472  
 473  **Returns**:
 474  
 475  A dictionary with the following keys:
 476  - `documents`: Created Documents
 477  
 478  <a id="json"></a>
 479  
 480  # Module json
 481  
 482  <a id="json.JSONConverter"></a>
 483  
 484  ## JSONConverter
 485  
 486  Converts one or more JSON files into a text document.
 487  
 488  ### Usage examples
 489  
 490  ```python
 491  import json
 492  
 493  from haystack.components.converters import JSONConverter
 494  from haystack.dataclasses import ByteStream
 495  
 496  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 497  
 498  converter = JSONConverter(content_key="text")
 499  results = converter.run(sources=[source])
 500  documents = results["documents"]
 501  print(documents[0].content)
 502  # 'This is the content of my document'
 503  ```
 504  
 505  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 506  to extract from the filtered data:
 507  
 508  ```python
 509  import json
 510  
 511  from haystack.components.converters import JSONConverter
 512  from haystack.dataclasses import ByteStream
 513  
 514  data = {
 515      "laureates": [
 516          {
 517              "firstname": "Enrico",
 518              "surname": "Fermi",
 519              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 520              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 521              " slow neutrons",
 522          },
 523          {
 524              "firstname": "Rita",
 525              "surname": "Levi-Montalcini",
 526              "motivation": "for their discoveries of growth factors",
 527          },
 528      ],
 529  }
 530  source = ByteStream.from_string(json.dumps(data))
 531  converter = JSONConverter(
 532      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 533  )
 534  
 535  results = converter.run(sources=[source])
 536  documents = results["documents"]
 537  print(documents[0].content)
 538  # 'for his demonstrations of the existence of new radioactive elements produced by
 539  # neutron irradiation, and for his related discovery of nuclear reactions brought
 540  # about by slow neutrons'
 541  
 542  print(documents[0].meta)
 543  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 544  
 545  print(documents[1].content)
 546  # 'for their discoveries of growth factors'
 547  
 548  print(documents[1].meta)
 549  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 550  ```
 551  
 552  <a id="json.JSONConverter.__init__"></a>
 553  
 554  #### JSONConverter.\_\_init\_\_
 555  
 556  ```python
 557  def __init__(jq_schema: Optional[str] = None,
 558               content_key: Optional[str] = None,
 559               extra_meta_fields: Optional[Union[set[str], Literal["*"]]] = None,
 560               store_full_path: bool = False)
 561  ```
 562  
 563  Creates a JSONConverter component.
 564  
 565  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 566  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 567  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 568  
 569  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 570  be set as the document's content.
 571  
 572  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 573  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 574  
 575  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 576  it will be skipped.
 577  
 578  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 579  
 580  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 581  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 582  the extracted documents. If a field is not found, the meta value will be `None`.
 583  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 584  be saved as metadata.
 585  
 586  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 587  
 588  **Arguments**:
 589  
 590  - `jq_schema`: Optional jq filter string to extract content.
 591  If not specified, whole JSON object will be used to extract information.
 592  - `content_key`: Optional key to extract document content.
 593  If `jq_schema` is specified, the `content_key` will be extracted from that object.
 594  - `extra_meta_fields`: An optional set of meta keys to extract from the content.
 595  If `jq_schema` is specified, all keys will be extracted from that object.
 596  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 597  If False, only the file name is stored.
 598  
 599  <a id="json.JSONConverter.to_dict"></a>
 600  
 601  #### JSONConverter.to\_dict
 602  
 603  ```python
 604  def to_dict() -> dict[str, Any]
 605  ```
 606  
 607  Serializes the component to a dictionary.
 608  
 609  **Returns**:
 610  
 611  Dictionary with serialized data.
 612  
 613  <a id="json.JSONConverter.from_dict"></a>
 614  
 615  #### JSONConverter.from\_dict
 616  
 617  ```python
 618  @classmethod
 619  def from_dict(cls, data: dict[str, Any]) -> "JSONConverter"
 620  ```
 621  
 622  Deserializes the component from a dictionary.
 623  
 624  **Arguments**:
 625  
 626  - `data`: Dictionary to deserialize from.
 627  
 628  **Returns**:
 629  
 630  Deserialized component.
 631  
 632  <a id="json.JSONConverter.run"></a>
 633  
 634  #### JSONConverter.run
 635  
 636  ```python
 637  @component.output_types(documents=list[Document])
 638  def run(sources: list[Union[str, Path, ByteStream]],
 639          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 640  ```
 641  
 642  Converts a list of JSON files to documents.
 643  
 644  **Arguments**:
 645  
 646  - `sources`: A list of file paths or ByteStream objects.
 647  - `meta`: Optional metadata to attach to the documents.
 648  This value can be either a list of dictionaries or a single dictionary.
 649  If it's a single dictionary, its content is added to the metadata of all produced documents.
 650  If it's a list, the length of the list must match the number of sources.
 651  If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
 652  
 653  **Returns**:
 654  
 655  A dictionary with the following keys:
 656  - `documents`: A list of created documents.
 657  
 658  <a id="markdown"></a>
 659  
 660  # Module markdown
 661  
 662  <a id="markdown.MarkdownToDocument"></a>
 663  
 664  ## MarkdownToDocument
 665  
 666  Converts a Markdown file into a text Document.
 667  
 668  Usage example:
 669  ```python
 670  from haystack.components.converters import MarkdownToDocument
 671  from datetime import datetime
 672  
 673  converter = MarkdownToDocument()
 674  results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
 675  documents = results["documents"]
 676  print(documents[0].content)
 677  # 'This is a text from the markdown file.'
 678  ```
 679  
 680  <a id="markdown.MarkdownToDocument.__init__"></a>
 681  
 682  #### MarkdownToDocument.\_\_init\_\_
 683  
 684  ```python
 685  def __init__(table_to_single_line: bool = False,
 686               progress_bar: bool = True,
 687               store_full_path: bool = False)
 688  ```
 689  
 690  Create a MarkdownToDocument component.
 691  
 692  **Arguments**:
 693  
 694  - `table_to_single_line`: If True converts table contents into a single line.
 695  - `progress_bar`: If True shows a progress bar when running.
 696  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 697  If False, only the file name is stored.
 698  
 699  <a id="markdown.MarkdownToDocument.run"></a>
 700  
 701  #### MarkdownToDocument.run
 702  
 703  ```python
 704  @component.output_types(documents=list[Document])
 705  def run(sources: list[Union[str, Path, ByteStream]],
 706          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 707  ```
 708  
 709  Converts a list of Markdown files to Documents.
 710  
 711  **Arguments**:
 712  
 713  - `sources`: List of file paths or ByteStream objects.
 714  - `meta`: Optional metadata to attach to the Documents.
 715  This value can be either a list of dictionaries or a single dictionary.
 716  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 717  If it's a list, the length of the list must match the number of sources, because the two lists will
 718  be zipped.
 719  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 720  
 721  **Returns**:
 722  
 723  A dictionary with the following keys:
 724  - `documents`: List of created Documents
 725  
 726  <a id="msg"></a>
 727  
 728  # Module msg
 729  
 730  <a id="msg.MSGToDocument"></a>
 731  
 732  ## MSGToDocument
 733  
 734  Converts Microsoft Outlook .msg files into Haystack Documents.
 735  
 736  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
 737  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
 738  file are extracted as ByteStream objects.
 739  
 740  ### Example Usage
 741  
 742  ```python
 743  from haystack.components.converters.msg import MSGToDocument
 744  from datetime import datetime
 745  
 746  converter = MSGToDocument()
 747  results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
 748  documents = results["documents"]
 749  attachments = results["attachments"]
 750  print(documents[0].content)
 751  ```
 752  
 753  <a id="msg.MSGToDocument.__init__"></a>
 754  
 755  #### MSGToDocument.\_\_init\_\_
 756  
 757  ```python
 758  def __init__(store_full_path: bool = False) -> None
 759  ```
 760  
 761  Creates a MSGToDocument component.
 762  
 763  **Arguments**:
 764  
 765  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 766  If False, only the file name is stored.
 767  
 768  <a id="msg.MSGToDocument.run"></a>
 769  
 770  #### MSGToDocument.run
 771  
 772  ```python
 773  @component.output_types(documents=list[Document], attachments=list[ByteStream])
 774  def run(
 775      sources: list[Union[str, Path, ByteStream]],
 776      meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
 777  ) -> dict[str, Union[list[Document], list[ByteStream]]]
 778  ```
 779  
 780  Converts MSG files to Documents.
 781  
 782  **Arguments**:
 783  
 784  - `sources`: List of file paths or ByteStream objects.
 785  - `meta`: Optional metadata to attach to the Documents.
 786  This value can be either a list of dictionaries or a single dictionary.
 787  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 788  If it's a list, the length of the list must match the number of sources, because the two lists will
 789  be zipped.
 790  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 791  
 792  **Returns**:
 793  
 794  A dictionary with the following keys:
 795  - `documents`: Created Documents.
 796  - `attachments`: Created ByteStream objects from file attachments.
 797  
 798  <a id="multi_file_converter"></a>
 799  
 800  # Module multi\_file\_converter
 801  
 802  <a id="multi_file_converter.MultiFileConverter"></a>
 803  
 804  ## MultiFileConverter
 805  
 806  A file converter that handles conversion of multiple file types.
 807  
 808  The MultiFileConverter handles the following file types:
 809  - CSV
 810  - DOCX
 811  - HTML
 812  - JSON
 813  - MD
 814  - TEXT
 815  - PDF (no OCR)
 816  - PPTX
 817  - XLSX
 818  
 819  Usage example:
 820  ```
 821  from haystack.super_components.converters import MultiFileConverter
 822  
 823  converter = MultiFileConverter()
 824  converter.run(sources=["test.txt", "test.pdf"], meta={})
 825  ```
 826  
 827  <a id="multi_file_converter.MultiFileConverter.__init__"></a>
 828  
 829  #### MultiFileConverter.\_\_init\_\_
 830  
 831  ```python
 832  def __init__(encoding: str = "utf-8",
 833               json_content_key: str = "content") -> None
 834  ```
 835  
 836  Initialize the MultiFileConverter.
 837  
 838  **Arguments**:
 839  
 840  - `encoding`: The encoding to use when reading files.
 841  - `json_content_key`: The key to use in a content field in a document when converting JSON files.
 842  
 843  <a id="openapi_functions"></a>
 844  
 845  # Module openapi\_functions
 846  
 847  <a id="openapi_functions.OpenAPIServiceToFunctions"></a>
 848  
 849  ## OpenAPIServiceToFunctions
 850  
 851  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
 852  
 853  The definition must respect OpenAPI specification 3.0.0 or higher.
 854  It can be specified in JSON or YAML format.
 855  Each function must have:
 856      - unique operationId
 857      - description
 858      - requestBody and/or parameters
 859      - schema for the requestBody and/or parameters
 860  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
 861  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
 862  
 863  Usage example:
 864  ```python
 865  from haystack.components.converters import OpenAPIServiceToFunctions
 866  
 867  converter = OpenAPIServiceToFunctions()
 868  result = converter.run(sources=["path/to/openapi_definition.yaml"])
 869  assert result["functions"]
 870  ```
 871  
 872  <a id="openapi_functions.OpenAPIServiceToFunctions.__init__"></a>
 873  
 874  #### OpenAPIServiceToFunctions.\_\_init\_\_
 875  
 876  ```python
 877  def __init__()
 878  ```
 879  
 880  Create an OpenAPIServiceToFunctions component.
 881  
 882  <a id="openapi_functions.OpenAPIServiceToFunctions.run"></a>
 883  
 884  #### OpenAPIServiceToFunctions.run
 885  
 886  ```python
 887  @component.output_types(functions=list[dict[str, Any]],
 888                          openapi_specs=list[dict[str, Any]])
 889  def run(sources: list[Union[str, Path, ByteStream]]) -> dict[str, Any]
 890  ```
 891  
 892  Converts OpenAPI definitions in OpenAI function calling format.
 893  
 894  **Arguments**:
 895  
 896  - `sources`: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
 897  
 898  **Raises**:
 899  
 900  - `RuntimeError`: If the OpenAPI definitions cannot be downloaded or processed.
 901  - `ValueError`: If the source type is not recognized or no functions are found in the OpenAPI definitions.
 902  
 903  **Returns**:
 904  
 905  A dictionary with the following keys:
 906  - functions: Function definitions in JSON object format
 907  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
 908  
 909  <a id="output_adapter"></a>
 910  
 911  # Module output\_adapter
 912  
 913  <a id="output_adapter.OutputAdaptationException"></a>
 914  
 915  ## OutputAdaptationException
 916  
 917  Exception raised when there is an error during output adaptation.
 918  
 919  <a id="output_adapter.OutputAdapter"></a>
 920  
 921  ## OutputAdapter
 922  
 923  Adapts output of a Component using Jinja templates.
 924  
 925  Usage example:
 926  ```python
 927  from haystack import Document
 928  from haystack.components.converters import OutputAdapter
 929  
 930  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
 931  documents = [Document(content="Test content"]
 932  result = adapter.run(documents=documents)
 933  
 934  assert result["output"] == "Test content"
 935  ```
 936  
 937  <a id="output_adapter.OutputAdapter.__init__"></a>
 938  
 939  #### OutputAdapter.\_\_init\_\_
 940  
 941  ```python
 942  def __init__(template: str,
 943               output_type: TypeAlias,
 944               custom_filters: Optional[dict[str, Callable]] = None,
 945               unsafe: bool = False)
 946  ```
 947  
 948  Create an OutputAdapter component.
 949  
 950  **Arguments**:
 951  
 952  - `template`: A Jinja template that defines how to adapt the input data.
 953  The variables in the template define the input of this instance.
 954  e.g.
 955  With this template:
 956  ```
 957  {{ documents[0].content }}
 958  ```
 959  The Component input will be `documents`.
 960  - `output_type`: The type of output this instance will return.
 961  - `custom_filters`: A dictionary of custom Jinja filters used in the template.
 962  - `unsafe`: Enable execution of arbitrary code in the Jinja template.
 963  This should only be used if you trust the source of the template as it can be lead to remote code execution.
 964  
 965  <a id="output_adapter.OutputAdapter.run"></a>
 966  
 967  #### OutputAdapter.run
 968  
 969  ```python
 970  def run(**kwargs)
 971  ```
 972  
 973  Renders the Jinja template with the provided inputs.
 974  
 975  **Arguments**:
 976  
 977  - `kwargs`: Must contain all variables used in the `template` string.
 978  
 979  **Raises**:
 980  
 981  - `OutputAdaptationException`: If template rendering fails.
 982  
 983  **Returns**:
 984  
 985  A dictionary with the following keys:
 986  - `output`: Rendered Jinja template.
 987  
 988  <a id="output_adapter.OutputAdapter.to_dict"></a>
 989  
 990  #### OutputAdapter.to\_dict
 991  
 992  ```python
 993  def to_dict() -> dict[str, Any]
 994  ```
 995  
 996  Serializes the component to a dictionary.
 997  
 998  **Returns**:
 999  
1000  Dictionary with serialized data.
1001  
1002  <a id="output_adapter.OutputAdapter.from_dict"></a>
1003  
1004  #### OutputAdapter.from\_dict
1005  
1006  ```python
1007  @classmethod
1008  def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter"
1009  ```
1010  
1011  Deserializes the component from a dictionary.
1012  
1013  **Arguments**:
1014  
1015  - `data`: The dictionary to deserialize from.
1016  
1017  **Returns**:
1018  
1019  The deserialized component.
1020  
1021  <a id="pdfminer"></a>
1022  
1023  # Module pdfminer
1024  
1025  <a id="pdfminer.CID_PATTERN"></a>
1026  
1027  #### CID\_PATTERN
1028  
1029  regex pattern to detect CID characters
1030  
1031  <a id="pdfminer.PDFMinerToDocument"></a>
1032  
1033  ## PDFMinerToDocument
1034  
1035  Converts PDF files to Documents.
1036  
1037  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1038  
1039  Usage example:
1040  ```python
1041  from haystack.components.converters.pdfminer import PDFMinerToDocument
1042  
1043  converter = PDFMinerToDocument()
1044  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1045  documents = results["documents"]
1046  print(documents[0].content)
1047  # 'This is a text from the PDF file.'
1048  ```
1049  
1050  <a id="pdfminer.PDFMinerToDocument.__init__"></a>
1051  
1052  #### PDFMinerToDocument.\_\_init\_\_
1053  
1054  ```python
1055  def __init__(line_overlap: float = 0.5,
1056               char_margin: float = 2.0,
1057               line_margin: float = 0.5,
1058               word_margin: float = 0.1,
1059               boxes_flow: Optional[float] = 0.5,
1060               detect_vertical: bool = True,
1061               all_texts: bool = False,
1062               store_full_path: bool = False) -> None
1063  ```
1064  
1065  Create a PDFMinerToDocument component.
1066  
1067  **Arguments**:
1068  
1069  - `line_overlap`: This parameter determines whether two characters are considered to be on
1070  the same line based on the amount of overlap between them.
1071  The overlap is calculated relative to the minimum height of both characters.
1072  - `char_margin`: Determines whether two characters are part of the same line based on the distance between them.
1073  If the distance is less than the margin specified, the characters are considered to be on the same line.
1074  The margin is calculated relative to the width of the character.
1075  - `word_margin`: Determines whether two characters on the same line are part of the same word
1076  based on the distance between them. If the distance is greater than the margin specified,
1077  an intermediate space will be added between them to make the text more readable.
1078  The margin is calculated relative to the width of the character.
1079  - `line_margin`: This parameter determines whether two lines are part of the same paragraph based on
1080  the distance between them. If the distance is less than the margin specified,
1081  the lines are considered to be part of the same paragraph.
1082  The margin is calculated relative to the height of a line.
1083  - `boxes_flow`: This parameter determines the importance of horizontal and vertical position when
1084  determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1085  with -1.0 indicating that only horizontal position matters and +1.0 indicating
1086  that only vertical position matters. Setting the value to 'None' will disable advanced
1087  layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1088  - `detect_vertical`: This parameter determines whether vertical text should be considered during layout analysis.
1089  - `all_texts`: If layout analysis should be performed on text in figures.
1090  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1091  If False, only the file name is stored.
1092  
1093  <a id="pdfminer.PDFMinerToDocument.detect_undecoded_cid_characters"></a>
1094  
1095  #### PDFMinerToDocument.detect\_undecoded\_cid\_characters
1096  
1097  ```python
1098  def detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1099  ```
1100  
1101  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1102  
1103  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1104  non-standard fonts.
1105  
1106  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1107  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1108  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1109  as is.
1110  
1111  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1112  
1113  :param: text: The text to check for undecoded CID characters
1114  :returns:
1115      A dictionary containing detection results
1116  
1117  
1118  <a id="pdfminer.PDFMinerToDocument.run"></a>
1119  
1120  #### PDFMinerToDocument.run
1121  
1122  ```python
1123  @component.output_types(documents=list[Document])
1124  def run(sources: list[Union[str, Path, ByteStream]],
1125          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1126  ```
1127  
1128  Converts PDF files to Documents.
1129  
1130  **Arguments**:
1131  
1132  - `sources`: List of PDF file paths or ByteStream objects.
1133  - `meta`: Optional metadata to attach to the Documents.
1134  This value can be either a list of dictionaries or a single dictionary.
1135  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1136  If it's a list, the length of the list must match the number of sources, because the two lists will
1137  be zipped.
1138  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1139  
1140  **Returns**:
1141  
1142  A dictionary with the following keys:
1143  - `documents`: Created Documents
1144  
1145  <a id="pptx"></a>
1146  
1147  # Module pptx
1148  
1149  <a id="pptx.PPTXToDocument"></a>
1150  
1151  ## PPTXToDocument
1152  
1153  Converts PPTX files to Documents.
1154  
1155  Usage example:
1156  ```python
1157  from haystack.components.converters.pptx import PPTXToDocument
1158  
1159  converter = PPTXToDocument()
1160  results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
1161  documents = results["documents"]
1162  print(documents[0].content)
1163  # 'This is the text from the PPTX file.'
1164  ```
1165  
1166  <a id="pptx.PPTXToDocument.__init__"></a>
1167  
1168  #### PPTXToDocument.\_\_init\_\_
1169  
1170  ```python
1171  def __init__(store_full_path: bool = False)
1172  ```
1173  
1174  Create an PPTXToDocument component.
1175  
1176  **Arguments**:
1177  
1178  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1179  If False, only the file name is stored.
1180  
1181  <a id="pptx.PPTXToDocument.run"></a>
1182  
1183  #### PPTXToDocument.run
1184  
1185  ```python
1186  @component.output_types(documents=list[Document])
1187  def run(sources: list[Union[str, Path, ByteStream]],
1188          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1189  ```
1190  
1191  Converts PPTX files to Documents.
1192  
1193  **Arguments**:
1194  
1195  - `sources`: List of file paths or ByteStream objects.
1196  - `meta`: Optional metadata to attach to the Documents.
1197  This value can be either a list of dictionaries or a single dictionary.
1198  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1199  If it's a list, the length of the list must match the number of sources, because the two lists will
1200  be zipped.
1201  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1202  
1203  **Returns**:
1204  
1205  A dictionary with the following keys:
1206  - `documents`: Created Documents
1207  
1208  <a id="pypdf"></a>
1209  
1210  # Module pypdf
1211  
1212  <a id="pypdf.PyPDFExtractionMode"></a>
1213  
1214  ## PyPDFExtractionMode
1215  
1216  The mode to use for extracting text from a PDF.
1217  
1218  <a id="pypdf.PyPDFExtractionMode.__str__"></a>
1219  
1220  #### PyPDFExtractionMode.\_\_str\_\_
1221  
1222  ```python
1223  def __str__() -> str
1224  ```
1225  
1226  Convert a PyPDFExtractionMode enum to a string.
1227  
1228  <a id="pypdf.PyPDFExtractionMode.from_str"></a>
1229  
1230  #### PyPDFExtractionMode.from\_str
1231  
1232  ```python
1233  @staticmethod
1234  def from_str(string: str) -> "PyPDFExtractionMode"
1235  ```
1236  
1237  Convert a string to a PyPDFExtractionMode enum.
1238  
1239  <a id="pypdf.PyPDFToDocument"></a>
1240  
1241  ## PyPDFToDocument
1242  
1243  Converts PDF files to documents your pipeline can query.
1244  
1245  This component uses the PyPDF library.
1246  You can attach metadata to the resulting documents.
1247  
1248  ### Usage example
1249  
1250  ```python
1251  from haystack.components.converters.pypdf import PyPDFToDocument
1252  
1253  converter = PyPDFToDocument()
1254  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1255  documents = results["documents"]
1256  print(documents[0].content)
1257  # 'This is a text from the PDF file.'
1258  ```
1259  
1260  <a id="pypdf.PyPDFToDocument.__init__"></a>
1261  
1262  #### PyPDFToDocument.\_\_init\_\_
1263  
1264  ```python
1265  def __init__(*,
1266               extraction_mode: Union[
1267                   str, PyPDFExtractionMode] = PyPDFExtractionMode.PLAIN,
1268               plain_mode_orientations: tuple = (0, 90, 180, 270),
1269               plain_mode_space_width: float = 200.0,
1270               layout_mode_space_vertically: bool = True,
1271               layout_mode_scale_weight: float = 1.25,
1272               layout_mode_strip_rotated: bool = True,
1273               layout_mode_font_height_weight: float = 1.0,
1274               store_full_path: bool = False)
1275  ```
1276  
1277  Create an PyPDFToDocument component.
1278  
1279  **Arguments**:
1280  
1281  - `extraction_mode`: The mode to use for extracting text from a PDF.
1282  Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1283  - `plain_mode_orientations`: Tuple of orientations to look for when extracting text from a PDF in plain mode.
1284  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1285  - `plain_mode_space_width`: Forces default space width if not extracted from font.
1286  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1287  - `layout_mode_space_vertically`: Whether to include blank lines inferred from y distance + font height.
1288  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1289  - `layout_mode_scale_weight`: Multiplier for string length when calculating weighted average character width.
1290  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1291  - `layout_mode_strip_rotated`: Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1292  If rotated text is discovered, layout will be degraded and a warning will be logged.
1293  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1294  - `layout_mode_font_height_weight`: Multiplier for font height when calculating blank line height.
1295  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1296  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1297  If False, only the file name is stored.
1298  
1299  <a id="pypdf.PyPDFToDocument.to_dict"></a>
1300  
1301  #### PyPDFToDocument.to\_dict
1302  
1303  ```python
1304  def to_dict()
1305  ```
1306  
1307  Serializes the component to a dictionary.
1308  
1309  **Returns**:
1310  
1311  Dictionary with serialized data.
1312  
1313  <a id="pypdf.PyPDFToDocument.from_dict"></a>
1314  
1315  #### PyPDFToDocument.from\_dict
1316  
1317  ```python
1318  @classmethod
1319  def from_dict(cls, data)
1320  ```
1321  
1322  Deserializes the component from a dictionary.
1323  
1324  **Arguments**:
1325  
1326  - `data`: Dictionary with serialized data.
1327  
1328  **Returns**:
1329  
1330  Deserialized component.
1331  
1332  <a id="pypdf.PyPDFToDocument.run"></a>
1333  
1334  #### PyPDFToDocument.run
1335  
1336  ```python
1337  @component.output_types(documents=list[Document])
1338  def run(sources: list[Union[str, Path, ByteStream]],
1339          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1340  ```
1341  
1342  Converts PDF files to documents.
1343  
1344  **Arguments**:
1345  
1346  - `sources`: List of file paths or ByteStream objects to convert.
1347  - `meta`: Optional metadata to attach to the documents.
1348  This value can be a list of dictionaries or a single dictionary.
1349  If it's a single dictionary, its content is added to the metadata of all produced documents.
1350  If it's a list, its length must match the number of sources, as they are zipped together.
1351  For ByteStream objects, their `meta` is added to the output documents.
1352  
1353  **Returns**:
1354  
1355  A dictionary with the following keys:
1356  - `documents`: A list of converted documents.
1357  
1358  <a id="tika"></a>
1359  
1360  # Module tika
1361  
1362  <a id="tika.XHTMLParser"></a>
1363  
1364  ## XHTMLParser
1365  
1366  Custom parser to extract pages from Tika XHTML content.
1367  
1368  <a id="tika.XHTMLParser.handle_starttag"></a>
1369  
1370  #### XHTMLParser.handle\_starttag
1371  
1372  ```python
1373  def handle_starttag(tag: str, attrs: list[tuple])
1374  ```
1375  
1376  Identify the start of a page div.
1377  
1378  <a id="tika.XHTMLParser.handle_endtag"></a>
1379  
1380  #### XHTMLParser.handle\_endtag
1381  
1382  ```python
1383  def handle_endtag(tag: str)
1384  ```
1385  
1386  Identify the end of a page div.
1387  
1388  <a id="tika.XHTMLParser.handle_data"></a>
1389  
1390  #### XHTMLParser.handle\_data
1391  
1392  ```python
1393  def handle_data(data: str)
1394  ```
1395  
1396  Populate the page content.
1397  
1398  <a id="tika.TikaDocumentConverter"></a>
1399  
1400  ## TikaDocumentConverter
1401  
1402  Converts files of different types to Documents.
1403  
1404  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1405  requires a running Tika server.
1406  For more options on running Tika,
1407  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1408  
1409  Usage example:
1410  ```python
1411  from haystack.components.converters.tika import TikaDocumentConverter
1412  
1413  converter = TikaDocumentConverter()
1414  results = converter.run(
1415      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1416      meta={"date_added": datetime.now().isoformat()}
1417  )
1418  documents = results["documents"]
1419  print(documents[0].content)
1420  # 'This is a text from the docx file.'
1421  ```
1422  
1423  <a id="tika.TikaDocumentConverter.__init__"></a>
1424  
1425  #### TikaDocumentConverter.\_\_init\_\_
1426  
1427  ```python
1428  def __init__(tika_url: str = "http://localhost:9998/tika",
1429               store_full_path: bool = False)
1430  ```
1431  
1432  Create a TikaDocumentConverter component.
1433  
1434  **Arguments**:
1435  
1436  - `tika_url`: Tika server URL.
1437  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1438  If False, only the file name is stored.
1439  
1440  <a id="tika.TikaDocumentConverter.run"></a>
1441  
1442  #### TikaDocumentConverter.run
1443  
1444  ```python
1445  @component.output_types(documents=list[Document])
1446  def run(sources: list[Union[str, Path, ByteStream]],
1447          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1448  ```
1449  
1450  Converts files to Documents.
1451  
1452  **Arguments**:
1453  
1454  - `sources`: List of HTML file paths or ByteStream objects.
1455  - `meta`: Optional metadata to attach to the Documents.
1456  This value can be either a list of dictionaries or a single dictionary.
1457  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1458  If it's a list, the length of the list must match the number of sources, because the two lists will
1459  be zipped.
1460  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1461  
1462  **Returns**:
1463  
1464  A dictionary with the following keys:
1465  - `documents`: Created Documents
1466  
1467  <a id="txt"></a>
1468  
1469  # Module txt
1470  
1471  <a id="txt.TextFileToDocument"></a>
1472  
1473  ## TextFileToDocument
1474  
1475  Converts text files to documents your pipeline can query.
1476  
1477  By default, it uses UTF-8 encoding when converting files but
1478  you can also set custom encoding.
1479  It can attach metadata to the resulting documents.
1480  
1481  ### Usage example
1482  
1483  ```python
1484  from haystack.components.converters.txt import TextFileToDocument
1485  
1486  converter = TextFileToDocument()
1487  results = converter.run(sources=["sample.txt"])
1488  documents = results["documents"]
1489  print(documents[0].content)
1490  # 'This is the content from the txt file.'
1491  ```
1492  
1493  <a id="txt.TextFileToDocument.__init__"></a>
1494  
1495  #### TextFileToDocument.\_\_init\_\_
1496  
1497  ```python
1498  def __init__(encoding: str = "utf-8", store_full_path: bool = False)
1499  ```
1500  
1501  Creates a TextFileToDocument component.
1502  
1503  **Arguments**:
1504  
1505  - `encoding`: The encoding of the text files to convert.
1506  If the encoding is specified in the metadata of a source ByteStream,
1507  it overrides this value.
1508  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1509  If False, only the file name is stored.
1510  
1511  <a id="txt.TextFileToDocument.run"></a>
1512  
1513  #### TextFileToDocument.run
1514  
1515  ```python
1516  @component.output_types(documents=list[Document])
1517  def run(sources: list[Union[str, Path, ByteStream]],
1518          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1519  ```
1520  
1521  Converts text files to documents.
1522  
1523  **Arguments**:
1524  
1525  - `sources`: List of text file paths or ByteStream objects to convert.
1526  - `meta`: Optional metadata to attach to the documents.
1527  This value can be a list of dictionaries or a single dictionary.
1528  If it's a single dictionary, its content is added to the metadata of all produced documents.
1529  If it's a list, its length must match the number of sources as they're zipped together.
1530  For ByteStream objects, their `meta` is added to the output documents.
1531  
1532  **Returns**:
1533  
1534  A dictionary with the following keys:
1535  - `documents`: A list of converted documents.
1536  
1537  <a id="xlsx"></a>
1538  
1539  # Module xlsx
1540  
1541  <a id="xlsx.XLSXToDocument"></a>
1542  
1543  ## XLSXToDocument
1544  
1545  Converts XLSX (Excel) files into Documents.
1546  
1547      Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1548      created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1549  
1550      ### Usage example
1551  
1552      ```python
1553      from haystack.components.converters.xlsx import XLSXToDocument
1554  
1555      converter = XLSXToDocument()
1556      results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
1557      documents = results["documents"]
1558      print(documents[0].content)
1559      # ",A,B
1560  1,col_a,col_b
1561  2,1.5,test
1562  "
1563      ```
1564  
1565  <a id="xlsx.XLSXToDocument.__init__"></a>
1566  
1567  #### XLSXToDocument.\_\_init\_\_
1568  
1569  ```python
1570  def __init__(table_format: Literal["csv", "markdown"] = "csv",
1571               sheet_name: Union[str, int, list[Union[str, int]], None] = None,
1572               read_excel_kwargs: Optional[dict[str, Any]] = None,
1573               table_format_kwargs: Optional[dict[str, Any]] = None,
1574               *,
1575               store_full_path: bool = False)
1576  ```
1577  
1578  Creates a XLSXToDocument component.
1579  
1580  **Arguments**:
1581  
1582  - `table_format`: The format to convert the Excel file to.
1583  - `sheet_name`: The name of the sheet to read. If None, all sheets are read.
1584  - `read_excel_kwargs`: Additional arguments to pass to `pandas.read_excel`.
1585  See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1586  - `table_format_kwargs`: Additional keyword arguments to pass to the table format function.
1587  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1588    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1589  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1590    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1591  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1592  If False, only the file name is stored.
1593  
1594  <a id="xlsx.XLSXToDocument.run"></a>
1595  
1596  #### XLSXToDocument.run
1597  
1598  ```python
1599  @component.output_types(documents=list[Document])
1600  def run(
1601      sources: list[Union[str, Path, ByteStream]],
1602      meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
1603  ) -> dict[str, list[Document]]
1604  ```
1605  
1606  Converts a XLSX file to a Document.
1607  
1608  **Arguments**:
1609  
1610  - `sources`: List of file paths or ByteStream objects.
1611  - `meta`: Optional metadata to attach to the documents.
1612  This value can be either a list of dictionaries or a single dictionary.
1613  If it's a single dictionary, its content is added to the metadata of all produced documents.
1614  If it's a list, the length of the list must match the number of sources, because the two lists will
1615  be zipped.
1616  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1617  
1618  **Returns**:
1619  
1620  A dictionary with the following keys:
1621  - `documents`: Created documents