Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.22 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: "Converters"
   3  id: converters-api
   4  description: "Various converters to transform data from one format to another."
   5  slug: "/converters-api"
   6  ---
   7  
   8  <a id="azure"></a>
   9  
  10  ## Module azure
  11  
  12  <a id="azure.AzureOCRDocumentConverter"></a>
  13  
  14  ### AzureOCRDocumentConverter
  15  
  16  Converts files to documents using Azure's Document Intelligence service.
  17  
  18  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  19  
  20  To use this component, you need an active Azure account
  21  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  22  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  23  
  24  ### Usage example
  25  
  26  ```python
  27  import os
  28  from datetime import datetime
  29  from haystack.components.converters import AzureOCRDocumentConverter
  30  from haystack.utils import Secret
  31  
  32  converter = AzureOCRDocumentConverter(
  33      endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
  34      api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
  35  )
  36  results = converter.run(
  37      sources=["test/test_files/pdf/react_paper.pdf"],
  38      meta={"date_added": datetime.now().isoformat()},
  39  )
  40  documents = results["documents"]
  41  print(documents[0].content)
  42  # 'This is a text from the PDF file.'
  43  ```
  44  
  45  <a id="azure.AzureOCRDocumentConverter.__init__"></a>
  46  
  47  #### AzureOCRDocumentConverter.\_\_init\_\_
  48  
  49  ```python
  50  def __init__(endpoint: str,
  51               api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  52               model_id: str = "prebuilt-read",
  53               preceding_context_len: int = 3,
  54               following_context_len: int = 3,
  55               merge_multiple_column_headers: bool = True,
  56               page_layout: Literal["natural", "single_column"] = "natural",
  57               threshold_y: float | None = 0.05,
  58               store_full_path: bool = False)
  59  ```
  60  
  61  Creates an AzureOCRDocumentConverter component.
  62  
  63  **Arguments**:
  64  
  65  - `endpoint`: The endpoint of your Azure resource.
  66  - `api_key`: The API key of your Azure resource.
  67  - `model_id`: The ID of the model you want to use. For a list of available models, see [Azure documentation]
  68  (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  69  - `preceding_context_len`: Number of lines before a table to include as preceding context
  70  (this will be added to the metadata).
  71  - `following_context_len`: Number of lines after a table to include as subsequent context (
  72  this will be added to the metadata).
  73  - `merge_multiple_column_headers`: If `True`, merges multiple column header rows into a single row.
  74  - `page_layout`: The type reading order to follow. Possible options:
  75  - `natural`: Uses the natural reading order determined by Azure.
  76  - `single_column`: Groups all lines with the same height on the page based on a threshold
  77  determined by `threshold_y`.
  78  - `threshold_y`: Only relevant if `single_column` is set to `page_layout`.
  79  The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  80  single line. This is crucial for section headers or numbers which may be spatially separated
  81  from the remaining text on the horizontal axis.
  82  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
  83  If False, only the file name is stored.
  84  
  85  <a id="azure.AzureOCRDocumentConverter.run"></a>
  86  
  87  #### AzureOCRDocumentConverter.run
  88  
  89  ```python
  90  @component.output_types(documents=list[Document],
  91                          raw_azure_response=list[dict])
  92  def run(sources: list[str | Path | ByteStream],
  93          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
  94  ```
  95  
  96  Convert a list of files to Documents using Azure's Document Intelligence service.
  97  
  98  **Arguments**:
  99  
 100  - `sources`: List of file paths or ByteStream objects.
 101  - `meta`: Optional metadata to attach to the Documents.
 102  This value can be either a list of dictionaries or a single dictionary.
 103  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 104  If it's a list, the length of the list must match the number of sources, because the two lists will be
 105  zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 106  
 107  **Returns**:
 108  
 109  A dictionary with the following keys:
 110  - `documents`: List of created Documents
 111  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 112  
 113  <a id="azure.AzureOCRDocumentConverter.to_dict"></a>
 114  
 115  #### AzureOCRDocumentConverter.to\_dict
 116  
 117  ```python
 118  def to_dict() -> dict[str, Any]
 119  ```
 120  
 121  Serializes the component to a dictionary.
 122  
 123  **Returns**:
 124  
 125  Dictionary with serialized data.
 126  
 127  <a id="azure.AzureOCRDocumentConverter.from_dict"></a>
 128  
 129  #### AzureOCRDocumentConverter.from\_dict
 130  
 131  ```python
 132  @classmethod
 133  def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter"
 134  ```
 135  
 136  Deserializes the component from a dictionary.
 137  
 138  **Arguments**:
 139  
 140  - `data`: The dictionary to deserialize from.
 141  
 142  **Returns**:
 143  
 144  The deserialized component.
 145  
 146  <a id="csv"></a>
 147  
 148  ## Module csv
 149  
 150  <a id="csv.CSVToDocument"></a>
 151  
 152  ### CSVToDocument
 153  
 154  Converts CSV files to Documents.
 155  
 156  By default, it uses UTF-8 encoding when converting files but
 157  you can also set a custom encoding.
 158  It can attach metadata to the resulting documents.
 159  
 160  ### Usage example
 161  
 162  ```python
 163  from haystack.components.converters.csv import CSVToDocument
 164  converter = CSVToDocument()
 165  results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
 166  documents = results["documents"]
 167  print(documents[0].content)
 168  # 'col1,col2\nrow1,row1\nrow2,row2\n'
 169  ```
 170  
 171  <a id="csv.CSVToDocument.__init__"></a>
 172  
 173  #### CSVToDocument.\_\_init\_\_
 174  
 175  ```python
 176  def __init__(encoding: str = "utf-8",
 177               store_full_path: bool = False,
 178               *,
 179               conversion_mode: Literal["file", "row"] = "file",
 180               delimiter: str = ",",
 181               quotechar: str = '"')
 182  ```
 183  
 184  Creates a CSVToDocument component.
 185  
 186  **Arguments**:
 187  
 188  - `encoding`: The encoding of the csv files to convert.
 189  If the encoding is specified in the metadata of a source ByteStream,
 190  it overrides this value.
 191  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 192  If False, only the file name is stored.
 193  - `conversion_mode`: - "file" (default): one Document per CSV file whose content is the raw CSV text.
 194  - "row": convert each CSV row to its own Document (requires `content_column` in `run()`).
 195  - `delimiter`: CSV delimiter used when parsing in row mode (passed to ``csv.DictReader``).
 196  - `quotechar`: CSV quote character used when parsing in row mode (passed to ``csv.DictReader``).
 197  
 198  <a id="csv.CSVToDocument.run"></a>
 199  
 200  #### CSVToDocument.run
 201  
 202  ```python
 203  @component.output_types(documents=list[Document])
 204  def run(sources: list[str | Path | ByteStream],
 205          *,
 206          content_column: str | None = None,
 207          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
 208  ```
 209  
 210  Converts CSV files to a Document (file mode) or to one Document per row (row mode).
 211  
 212  **Arguments**:
 213  
 214  - `sources`: List of file paths or ByteStream objects.
 215  - `content_column`: **Required when** ``conversion_mode="row"``.
 216  The column name whose values become ``Document.content`` for each row.
 217  The column must exist in the CSV header.
 218  - `meta`: Optional metadata to attach to the documents.
 219  This value can be either a list of dictionaries or a single dictionary.
 220  If it's a single dictionary, its content is added to the metadata of all produced documents.
 221  If it's a list, the length of the list must match the number of sources, because the two lists will
 222  be zipped.
 223  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 224  
 225  **Returns**:
 226  
 227  A dictionary with the following keys:
 228  - `documents`: Created documents
 229  
 230  <a id="docx"></a>
 231  
 232  ## Module docx
 233  
 234  <a id="docx.DOCXMetadata"></a>
 235  
 236  ### DOCXMetadata
 237  
 238  Describes the metadata of Docx file.
 239  
 240  **Arguments**:
 241  
 242  - `author`: The author
 243  - `category`: The category
 244  - `comments`: The comments
 245  - `content_status`: The content status
 246  - `created`: The creation date (ISO formatted string)
 247  - `identifier`: The identifier
 248  - `keywords`: Available keywords
 249  - `language`: The language of the document
 250  - `last_modified_by`: User who last modified the document
 251  - `last_printed`: The last printed date (ISO formatted string)
 252  - `modified`: The last modification date (ISO formatted string)
 253  - `revision`: The revision number
 254  - `subject`: The subject
 255  - `title`: The title
 256  - `version`: The version
 257  
 258  <a id="docx.DOCXTableFormat"></a>
 259  
 260  ### DOCXTableFormat
 261  
 262  Supported formats for storing DOCX tabular data in a Document.
 263  
 264  <a id="docx.DOCXTableFormat.from_str"></a>
 265  
 266  #### DOCXTableFormat.from\_str
 267  
 268  ```python
 269  @staticmethod
 270  def from_str(string: str) -> "DOCXTableFormat"
 271  ```
 272  
 273  Convert a string to a DOCXTableFormat enum.
 274  
 275  <a id="docx.DOCXLinkFormat"></a>
 276  
 277  ### DOCXLinkFormat
 278  
 279  Supported formats for storing DOCX link information in a Document.
 280  
 281  <a id="docx.DOCXLinkFormat.from_str"></a>
 282  
 283  #### DOCXLinkFormat.from\_str
 284  
 285  ```python
 286  @staticmethod
 287  def from_str(string: str) -> "DOCXLinkFormat"
 288  ```
 289  
 290  Convert a string to a DOCXLinkFormat enum.
 291  
 292  <a id="docx.DOCXToDocument"></a>
 293  
 294  ### DOCXToDocument
 295  
 296  Converts DOCX files to Documents.
 297  
 298  Uses `python-docx` library to convert the DOCX file to a document.
 299  This component does not preserve page breaks in the original document.
 300  
 301  Usage example:
 302  ```python
 303  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 304  
 305  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 306  results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
 307  documents = results["documents"]
 308  print(documents[0].content)
 309  # 'This is a text from the DOCX file.'
 310  ```
 311  
 312  <a id="docx.DOCXToDocument.__init__"></a>
 313  
 314  #### DOCXToDocument.\_\_init\_\_
 315  
 316  ```python
 317  def __init__(table_format: str | DOCXTableFormat = DOCXTableFormat.CSV,
 318               link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE,
 319               store_full_path: bool = False)
 320  ```
 321  
 322  Create a DOCXToDocument component.
 323  
 324  **Arguments**:
 325  
 326  - `table_format`: The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 327  DOCXTableFormat.CSV, "markdown", or "csv".
 328  - `link_format`: The format for link output. Can be either:
 329  DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 330  DOCXLinkFormat.PLAIN or "plain" to get text (address),
 331  DOCXLinkFormat.NONE or "none" to get text without links.
 332  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 333  If False, only the file name is stored.
 334  
 335  <a id="docx.DOCXToDocument.to_dict"></a>
 336  
 337  #### DOCXToDocument.to\_dict
 338  
 339  ```python
 340  def to_dict() -> dict[str, Any]
 341  ```
 342  
 343  Serializes the component to a dictionary.
 344  
 345  **Returns**:
 346  
 347  Dictionary with serialized data.
 348  
 349  <a id="docx.DOCXToDocument.from_dict"></a>
 350  
 351  #### DOCXToDocument.from\_dict
 352  
 353  ```python
 354  @classmethod
 355  def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument"
 356  ```
 357  
 358  Deserializes the component from a dictionary.
 359  
 360  **Arguments**:
 361  
 362  - `data`: The dictionary to deserialize from.
 363  
 364  **Returns**:
 365  
 366  The deserialized component.
 367  
 368  <a id="docx.DOCXToDocument.run"></a>
 369  
 370  #### DOCXToDocument.run
 371  
 372  ```python
 373  @component.output_types(documents=list[Document])
 374  def run(sources: list[str | Path | ByteStream],
 375          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
 376  ```
 377  
 378  Converts DOCX files to Documents.
 379  
 380  **Arguments**:
 381  
 382  - `sources`: List of file paths or ByteStream objects.
 383  - `meta`: Optional metadata to attach to the Documents.
 384  This value can be either a list of dictionaries or a single dictionary.
 385  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 386  If it's a list, the length of the list must match the number of sources, because the two lists will
 387  be zipped.
 388  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 389  
 390  **Returns**:
 391  
 392  A dictionary with the following keys:
 393  - `documents`: Created Documents
 394  
 395  <a id="html"></a>
 396  
 397  ## Module html
 398  
 399  <a id="html.HTMLToDocument"></a>
 400  
 401  ### HTMLToDocument
 402  
 403  Converts an HTML file to a Document.
 404  
 405  Usage example:
 406  ```python
 407  from haystack.components.converters import HTMLToDocument
 408  
 409  converter = HTMLToDocument()
 410  results = converter.run(sources=["path/to/sample.html"])
 411  documents = results["documents"]
 412  print(documents[0].content)
 413  # 'This is a text from the HTML file.'
 414  ```
 415  
 416  <a id="html.HTMLToDocument.__init__"></a>
 417  
 418  #### HTMLToDocument.\_\_init\_\_
 419  
 420  ```python
 421  def __init__(extraction_kwargs: dict[str, Any] | None = None,
 422               store_full_path: bool = False)
 423  ```
 424  
 425  Create an HTMLToDocument component.
 426  
 427  **Arguments**:
 428  
 429  - `extraction_kwargs`: A dictionary containing keyword arguments to customize the extraction process. These
 430  are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 431  the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 432  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 433  If False, only the file name is stored.
 434  
 435  <a id="html.HTMLToDocument.to_dict"></a>
 436  
 437  #### HTMLToDocument.to\_dict
 438  
 439  ```python
 440  def to_dict() -> dict[str, Any]
 441  ```
 442  
 443  Serializes the component to a dictionary.
 444  
 445  **Returns**:
 446  
 447  Dictionary with serialized data.
 448  
 449  <a id="html.HTMLToDocument.from_dict"></a>
 450  
 451  #### HTMLToDocument.from\_dict
 452  
 453  ```python
 454  @classmethod
 455  def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument"
 456  ```
 457  
 458  Deserializes the component from a dictionary.
 459  
 460  **Arguments**:
 461  
 462  - `data`: The dictionary to deserialize from.
 463  
 464  **Returns**:
 465  
 466  The deserialized component.
 467  
 468  <a id="html.HTMLToDocument.run"></a>
 469  
 470  #### HTMLToDocument.run
 471  
 472  ```python
 473  @component.output_types(documents=list[Document])
 474  def run(sources: list[str | Path | ByteStream],
 475          meta: dict[str, Any] | list[dict[str, Any]] | None = None,
 476          extraction_kwargs: dict[str, Any] | None = None)
 477  ```
 478  
 479  Converts a list of HTML files to Documents.
 480  
 481  **Arguments**:
 482  
 483  - `sources`: List of HTML file paths or ByteStream objects.
 484  - `meta`: Optional metadata to attach to the Documents.
 485  This value can be either a list of dictionaries or a single dictionary.
 486  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 487  If it's a list, the length of the list must match the number of sources, because the two lists will
 488  be zipped.
 489  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 490  - `extraction_kwargs`: Additional keyword arguments to customize the extraction process.
 491  
 492  **Returns**:
 493  
 494  A dictionary with the following keys:
 495  - `documents`: Created Documents
 496  
 497  <a id="json"></a>
 498  
 499  ## Module json
 500  
 501  <a id="json.JSONConverter"></a>
 502  
 503  ### JSONConverter
 504  
 505  Converts one or more JSON files into a text document.
 506  
 507  ### Usage examples
 508  
 509  ```python
 510  import json
 511  
 512  from haystack.components.converters import JSONConverter
 513  from haystack.dataclasses import ByteStream
 514  
 515  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 516  
 517  converter = JSONConverter(content_key="text")
 518  results = converter.run(sources=[source])
 519  documents = results["documents"]
 520  print(documents[0].content)
 521  # 'This is the content of my document'
 522  ```
 523  
 524  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 525  to extract from the filtered data:
 526  
 527  ```python
 528  import json
 529  
 530  from haystack.components.converters import JSONConverter
 531  from haystack.dataclasses import ByteStream
 532  
 533  data = {
 534      "laureates": [
 535          {
 536              "firstname": "Enrico",
 537              "surname": "Fermi",
 538              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 539              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 540              " slow neutrons",
 541          },
 542          {
 543              "firstname": "Rita",
 544              "surname": "Levi-Montalcini",
 545              "motivation": "for their discoveries of growth factors",
 546          },
 547      ],
 548  }
 549  source = ByteStream.from_string(json.dumps(data))
 550  converter = JSONConverter(
 551      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 552  )
 553  
 554  results = converter.run(sources=[source])
 555  documents = results["documents"]
 556  print(documents[0].content)
 557  # 'for his demonstrations of the existence of new radioactive elements produced by
 558  # neutron irradiation, and for his related discovery of nuclear reactions brought
 559  # about by slow neutrons'
 560  
 561  print(documents[0].meta)
 562  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 563  
 564  print(documents[1].content)
 565  # 'for their discoveries of growth factors'
 566  
 567  print(documents[1].meta)
 568  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 569  ```
 570  
 571  <a id="json.JSONConverter.__init__"></a>
 572  
 573  #### JSONConverter.\_\_init\_\_
 574  
 575  ```python
 576  def __init__(jq_schema: str | None = None,
 577               content_key: str | None = None,
 578               extra_meta_fields: set[str] | Literal["*"] | None = None,
 579               store_full_path: bool = False)
 580  ```
 581  
 582  Creates a JSONConverter component.
 583  
 584  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 585  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 586  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 587  
 588  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 589  be set as the document's content.
 590  
 591  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 592  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 593  
 594  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 595  it will be skipped.
 596  
 597  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 598  
 599  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 600  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 601  the extracted documents. If a field is not found, the meta value will be `None`.
 602  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 603  be saved as metadata.
 604  
 605  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 606  
 607  **Arguments**:
 608  
 609  - `jq_schema`: Optional jq filter string to extract content.
 610  If not specified, whole JSON object will be used to extract information.
 611  - `content_key`: Optional key to extract document content.
 612  If `jq_schema` is specified, the `content_key` will be extracted from that object.
 613  - `extra_meta_fields`: An optional set of meta keys to extract from the content.
 614  If `jq_schema` is specified, all keys will be extracted from that object.
 615  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 616  If False, only the file name is stored.
 617  
 618  <a id="json.JSONConverter.to_dict"></a>
 619  
 620  #### JSONConverter.to\_dict
 621  
 622  ```python
 623  def to_dict() -> dict[str, Any]
 624  ```
 625  
 626  Serializes the component to a dictionary.
 627  
 628  **Returns**:
 629  
 630  Dictionary with serialized data.
 631  
 632  <a id="json.JSONConverter.from_dict"></a>
 633  
 634  #### JSONConverter.from\_dict
 635  
 636  ```python
 637  @classmethod
 638  def from_dict(cls, data: dict[str, Any]) -> "JSONConverter"
 639  ```
 640  
 641  Deserializes the component from a dictionary.
 642  
 643  **Arguments**:
 644  
 645  - `data`: Dictionary to deserialize from.
 646  
 647  **Returns**:
 648  
 649  Deserialized component.
 650  
 651  <a id="json.JSONConverter.run"></a>
 652  
 653  #### JSONConverter.run
 654  
 655  ```python
 656  @component.output_types(documents=list[Document])
 657  def run(sources: list[str | Path | ByteStream],
 658          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
 659  ```
 660  
 661  Converts a list of JSON files to documents.
 662  
 663  **Arguments**:
 664  
 665  - `sources`: A list of file paths or ByteStream objects.
 666  - `meta`: Optional metadata to attach to the documents.
 667  This value can be either a list of dictionaries or a single dictionary.
 668  If it's a single dictionary, its content is added to the metadata of all produced documents.
 669  If it's a list, the length of the list must match the number of sources.
 670  If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
 671  
 672  **Returns**:
 673  
 674  A dictionary with the following keys:
 675  - `documents`: A list of created documents.
 676  
 677  <a id="markdown"></a>
 678  
 679  ## Module markdown
 680  
 681  <a id="markdown.MarkdownToDocument"></a>
 682  
 683  ### MarkdownToDocument
 684  
 685  Converts a Markdown file into a text Document.
 686  
 687  Usage example:
 688  ```python
 689  from haystack.components.converters import MarkdownToDocument
 690  from datetime import datetime
 691  
 692  converter = MarkdownToDocument()
 693  results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
 694  documents = results["documents"]
 695  print(documents[0].content)
 696  # 'This is a text from the markdown file.'
 697  ```
 698  
 699  <a id="markdown.MarkdownToDocument.__init__"></a>
 700  
 701  #### MarkdownToDocument.\_\_init\_\_
 702  
 703  ```python
 704  def __init__(table_to_single_line: bool = False,
 705               progress_bar: bool = True,
 706               store_full_path: bool = False)
 707  ```
 708  
 709  Create a MarkdownToDocument component.
 710  
 711  **Arguments**:
 712  
 713  - `table_to_single_line`: If True converts table contents into a single line.
 714  - `progress_bar`: If True shows a progress bar when running.
 715  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 716  If False, only the file name is stored.
 717  
 718  <a id="markdown.MarkdownToDocument.run"></a>
 719  
 720  #### MarkdownToDocument.run
 721  
 722  ```python
 723  @component.output_types(documents=list[Document])
 724  def run(sources: list[str | Path | ByteStream],
 725          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
 726  ```
 727  
 728  Converts a list of Markdown files to Documents.
 729  
 730  **Arguments**:
 731  
 732  - `sources`: List of file paths or ByteStream objects.
 733  - `meta`: Optional metadata to attach to the Documents.
 734  This value can be either a list of dictionaries or a single dictionary.
 735  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 736  If it's a list, the length of the list must match the number of sources, because the two lists will
 737  be zipped.
 738  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 739  
 740  **Returns**:
 741  
 742  A dictionary with the following keys:
 743  - `documents`: List of created Documents
 744  
 745  <a id="msg"></a>
 746  
 747  ## Module msg
 748  
 749  <a id="msg.MSGToDocument"></a>
 750  
 751  ### MSGToDocument
 752  
 753  Converts Microsoft Outlook .msg files into Haystack Documents.
 754  
 755  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
 756  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
 757  file are extracted as ByteStream objects.
 758  
 759  ### Example Usage
 760  
 761  ```python
 762  from haystack.components.converters.msg import MSGToDocument
 763  from datetime import datetime
 764  
 765  converter = MSGToDocument()
 766  results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
 767  documents = results["documents"]
 768  attachments = results["attachments"]
 769  print(documents[0].content)
 770  ```
 771  
 772  <a id="msg.MSGToDocument.__init__"></a>
 773  
 774  #### MSGToDocument.\_\_init\_\_
 775  
 776  ```python
 777  def __init__(store_full_path: bool = False) -> None
 778  ```
 779  
 780  Creates a MSGToDocument component.
 781  
 782  **Arguments**:
 783  
 784  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 785  If False, only the file name is stored.
 786  
 787  <a id="msg.MSGToDocument.run"></a>
 788  
 789  #### MSGToDocument.run
 790  
 791  ```python
 792  @component.output_types(documents=list[Document], attachments=list[ByteStream])
 793  def run(
 794      sources: list[str | Path | ByteStream],
 795      meta: dict[str, Any] | list[dict[str, Any]] | None = None
 796  ) -> dict[str, list[Document] | list[ByteStream]]
 797  ```
 798  
 799  Converts MSG files to Documents.
 800  
 801  **Arguments**:
 802  
 803  - `sources`: List of file paths or ByteStream objects.
 804  - `meta`: Optional metadata to attach to the Documents.
 805  This value can be either a list of dictionaries or a single dictionary.
 806  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 807  If it's a list, the length of the list must match the number of sources, because the two lists will
 808  be zipped.
 809  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 810  
 811  **Returns**:
 812  
 813  A dictionary with the following keys:
 814  - `documents`: Created Documents.
 815  - `attachments`: Created ByteStream objects from file attachments.
 816  
 817  <a id="multi_file_converter"></a>
 818  
 819  ## Module multi\_file\_converter
 820  
 821  <a id="multi_file_converter.MultiFileConverter"></a>
 822  
 823  ### MultiFileConverter
 824  
 825  A file converter that handles conversion of multiple file types.
 826  
 827  The MultiFileConverter handles the following file types:
 828  - CSV
 829  - DOCX
 830  - HTML
 831  - JSON
 832  - MD
 833  - TEXT
 834  - PDF (no OCR)
 835  - PPTX
 836  - XLSX
 837  
 838  Usage example:
 839  ```
 840  from haystack.super_components.converters import MultiFileConverter
 841  
 842  converter = MultiFileConverter()
 843  converter.run(sources=["test.txt", "test.pdf"], meta={})
 844  ```
 845  
 846  <a id="multi_file_converter.MultiFileConverter.__init__"></a>
 847  
 848  #### MultiFileConverter.\_\_init\_\_
 849  
 850  ```python
 851  def __init__(encoding: str = "utf-8",
 852               json_content_key: str = "content") -> None
 853  ```
 854  
 855  Initialize the MultiFileConverter.
 856  
 857  **Arguments**:
 858  
 859  - `encoding`: The encoding to use when reading files.
 860  - `json_content_key`: The key to use in a content field in a document when converting JSON files.
 861  
 862  <a id="openapi_functions"></a>
 863  
 864  ## Module openapi\_functions
 865  
 866  <a id="openapi_functions.OpenAPIServiceToFunctions"></a>
 867  
 868  ### OpenAPIServiceToFunctions
 869  
 870  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
 871  
 872  The definition must respect OpenAPI specification 3.0.0 or higher.
 873  It can be specified in JSON or YAML format.
 874  Each function must have:
 875      - unique operationId
 876      - description
 877      - requestBody and/or parameters
 878      - schema for the requestBody and/or parameters
 879  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
 880  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
 881  
 882  Usage example:
 883  ```python
 884  from haystack.components.converters import OpenAPIServiceToFunctions
 885  
 886  converter = OpenAPIServiceToFunctions()
 887  result = converter.run(sources=["path/to/openapi_definition.yaml"])
 888  assert result["functions"]
 889  ```
 890  
 891  <a id="openapi_functions.OpenAPIServiceToFunctions.__init__"></a>
 892  
 893  #### OpenAPIServiceToFunctions.\_\_init\_\_
 894  
 895  ```python
 896  def __init__()
 897  ```
 898  
 899  Create an OpenAPIServiceToFunctions component.
 900  
 901  <a id="openapi_functions.OpenAPIServiceToFunctions.run"></a>
 902  
 903  #### OpenAPIServiceToFunctions.run
 904  
 905  ```python
 906  @component.output_types(functions=list[dict[str, Any]],
 907                          openapi_specs=list[dict[str, Any]])
 908  def run(sources: list[str | Path | ByteStream]) -> dict[str, Any]
 909  ```
 910  
 911  Converts OpenAPI definitions in OpenAI function calling format.
 912  
 913  **Arguments**:
 914  
 915  - `sources`: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
 916  
 917  **Raises**:
 918  
 919  - `RuntimeError`: If the OpenAPI definitions cannot be downloaded or processed.
 920  - `ValueError`: If the source type is not recognized or no functions are found in the OpenAPI definitions.
 921  
 922  **Returns**:
 923  
 924  A dictionary with the following keys:
 925  - functions: Function definitions in JSON object format
 926  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
 927  
 928  <a id="output_adapter"></a>
 929  
 930  ## Module output\_adapter
 931  
 932  <a id="output_adapter.OutputAdaptationException"></a>
 933  
 934  ### OutputAdaptationException
 935  
 936  Exception raised when there is an error during output adaptation.
 937  
 938  <a id="output_adapter.OutputAdapter"></a>
 939  
 940  ### OutputAdapter
 941  
 942  Adapts output of a Component using Jinja templates.
 943  
 944  Usage example:
 945  ```python
 946  from haystack import Document
 947  from haystack.components.converters import OutputAdapter
 948  
 949  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
 950  documents = [Document(content="Test content"]
 951  result = adapter.run(documents=documents)
 952  
 953  assert result["output"] == "Test content"
 954  ```
 955  
 956  <a id="output_adapter.OutputAdapter.__init__"></a>
 957  
 958  #### OutputAdapter.\_\_init\_\_
 959  
 960  ```python
 961  def __init__(template: str,
 962               output_type: TypeAlias,
 963               custom_filters: dict[str, Callable] | None = None,
 964               unsafe: bool = False) -> None
 965  ```
 966  
 967  Create an OutputAdapter component.
 968  
 969  **Arguments**:
 970  
 971  - `template`: A Jinja template that defines how to adapt the input data.
 972  The variables in the template define the input of this instance.
 973  e.g.
 974  With this template:
 975  ```
 976  {{ documents[0].content }}
 977  ```
 978  The Component input will be `documents`.
 979  - `output_type`: The type of output this instance will return.
 980  - `custom_filters`: A dictionary of custom Jinja filters used in the template.
 981  - `unsafe`: Enable execution of arbitrary code in the Jinja template.
 982  This should only be used if you trust the source of the template as it can be lead to remote code execution.
 983  
 984  <a id="output_adapter.OutputAdapter.run"></a>
 985  
 986  #### OutputAdapter.run
 987  
 988  ```python
 989  def run(**kwargs)
 990  ```
 991  
 992  Renders the Jinja template with the provided inputs.
 993  
 994  **Arguments**:
 995  
 996  - `kwargs`: Must contain all variables used in the `template` string.
 997  
 998  **Raises**:
 999  
1000  - `OutputAdaptationException`: If template rendering fails.
1001  
1002  **Returns**:
1003  
1004  A dictionary with the following keys:
1005  - `output`: Rendered Jinja template.
1006  
1007  <a id="output_adapter.OutputAdapter.to_dict"></a>
1008  
1009  #### OutputAdapter.to\_dict
1010  
1011  ```python
1012  def to_dict() -> dict[str, Any]
1013  ```
1014  
1015  Serializes the component to a dictionary.
1016  
1017  **Returns**:
1018  
1019  Dictionary with serialized data.
1020  
1021  <a id="output_adapter.OutputAdapter.from_dict"></a>
1022  
1023  #### OutputAdapter.from\_dict
1024  
1025  ```python
1026  @classmethod
1027  def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter"
1028  ```
1029  
1030  Deserializes the component from a dictionary.
1031  
1032  **Arguments**:
1033  
1034  - `data`: The dictionary to deserialize from.
1035  
1036  **Returns**:
1037  
1038  The deserialized component.
1039  
1040  <a id="pdfminer"></a>
1041  
1042  ## Module pdfminer
1043  
1044  <a id="pdfminer.CID_PATTERN"></a>
1045  
1046  #### CID\_PATTERN
1047  
1048  regex pattern to detect CID characters
1049  
1050  <a id="pdfminer.PDFMinerToDocument"></a>
1051  
1052  ### PDFMinerToDocument
1053  
1054  Converts PDF files to Documents.
1055  
1056  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1057  
1058  Usage example:
1059  ```python
1060  from haystack.components.converters.pdfminer import PDFMinerToDocument
1061  
1062  converter = PDFMinerToDocument()
1063  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1064  documents = results["documents"]
1065  print(documents[0].content)
1066  # 'This is a text from the PDF file.'
1067  ```
1068  
1069  <a id="pdfminer.PDFMinerToDocument.__init__"></a>
1070  
1071  #### PDFMinerToDocument.\_\_init\_\_
1072  
1073  ```python
1074  def __init__(line_overlap: float = 0.5,
1075               char_margin: float = 2.0,
1076               line_margin: float = 0.5,
1077               word_margin: float = 0.1,
1078               boxes_flow: float | None = 0.5,
1079               detect_vertical: bool = True,
1080               all_texts: bool = False,
1081               store_full_path: bool = False) -> None
1082  ```
1083  
1084  Create a PDFMinerToDocument component.
1085  
1086  **Arguments**:
1087  
1088  - `line_overlap`: This parameter determines whether two characters are considered to be on
1089  the same line based on the amount of overlap between them.
1090  The overlap is calculated relative to the minimum height of both characters.
1091  - `char_margin`: Determines whether two characters are part of the same line based on the distance between them.
1092  If the distance is less than the margin specified, the characters are considered to be on the same line.
1093  The margin is calculated relative to the width of the character.
1094  - `word_margin`: Determines whether two characters on the same line are part of the same word
1095  based on the distance between them. If the distance is greater than the margin specified,
1096  an intermediate space will be added between them to make the text more readable.
1097  The margin is calculated relative to the width of the character.
1098  - `line_margin`: This parameter determines whether two lines are part of the same paragraph based on
1099  the distance between them. If the distance is less than the margin specified,
1100  the lines are considered to be part of the same paragraph.
1101  The margin is calculated relative to the height of a line.
1102  - `boxes_flow`: This parameter determines the importance of horizontal and vertical position when
1103  determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1104  with -1.0 indicating that only horizontal position matters and +1.0 indicating
1105  that only vertical position matters. Setting the value to 'None' will disable advanced
1106  layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1107  - `detect_vertical`: This parameter determines whether vertical text should be considered during layout analysis.
1108  - `all_texts`: If layout analysis should be performed on text in figures.
1109  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1110  If False, only the file name is stored.
1111  
1112  <a id="pdfminer.PDFMinerToDocument.detect_undecoded_cid_characters"></a>
1113  
1114  #### PDFMinerToDocument.detect\_undecoded\_cid\_characters
1115  
1116  ```python
1117  def detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1118  ```
1119  
1120  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1121  
1122  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1123  non-standard fonts.
1124  
1125  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1126  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1127  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1128  as is.
1129  
1130  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1131  
1132  :param: text: The text to check for undecoded CID characters
1133  :returns:
1134      A dictionary containing detection results
1135  
1136  
1137  <a id="pdfminer.PDFMinerToDocument.run"></a>
1138  
1139  #### PDFMinerToDocument.run
1140  
1141  ```python
1142  @component.output_types(documents=list[Document])
1143  def run(sources: list[str | Path | ByteStream],
1144          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
1145  ```
1146  
1147  Converts PDF files to Documents.
1148  
1149  **Arguments**:
1150  
1151  - `sources`: List of PDF file paths or ByteStream objects.
1152  - `meta`: Optional metadata to attach to the Documents.
1153  This value can be either a list of dictionaries or a single dictionary.
1154  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1155  If it's a list, the length of the list must match the number of sources, because the two lists will
1156  be zipped.
1157  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1158  
1159  **Returns**:
1160  
1161  A dictionary with the following keys:
1162  - `documents`: Created Documents
1163  
1164  <a id="pptx"></a>
1165  
1166  ## Module pptx
1167  
1168  <a id="pptx.PPTXToDocument"></a>
1169  
1170  ### PPTXToDocument
1171  
1172  Converts PPTX files to Documents.
1173  
1174  Usage example:
1175  ```python
1176  from haystack.components.converters.pptx import PPTXToDocument
1177  
1178  converter = PPTXToDocument()
1179  results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
1180  documents = results["documents"]
1181  print(documents[0].content)
1182  # 'This is the text from the PPTX file.'
1183  ```
1184  
1185  <a id="pptx.PPTXToDocument.__init__"></a>
1186  
1187  #### PPTXToDocument.\_\_init\_\_
1188  
1189  ```python
1190  def __init__(store_full_path: bool = False)
1191  ```
1192  
1193  Create an PPTXToDocument component.
1194  
1195  **Arguments**:
1196  
1197  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1198  If False, only the file name is stored.
1199  
1200  <a id="pptx.PPTXToDocument.run"></a>
1201  
1202  #### PPTXToDocument.run
1203  
1204  ```python
1205  @component.output_types(documents=list[Document])
1206  def run(sources: list[str | Path | ByteStream],
1207          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
1208  ```
1209  
1210  Converts PPTX files to Documents.
1211  
1212  **Arguments**:
1213  
1214  - `sources`: List of file paths or ByteStream objects.
1215  - `meta`: Optional metadata to attach to the Documents.
1216  This value can be either a list of dictionaries or a single dictionary.
1217  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1218  If it's a list, the length of the list must match the number of sources, because the two lists will
1219  be zipped.
1220  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1221  
1222  **Returns**:
1223  
1224  A dictionary with the following keys:
1225  - `documents`: Created Documents
1226  
1227  <a id="pypdf"></a>
1228  
1229  ## Module pypdf
1230  
1231  <a id="pypdf.PyPDFExtractionMode"></a>
1232  
1233  ### PyPDFExtractionMode
1234  
1235  The mode to use for extracting text from a PDF.
1236  
1237  <a id="pypdf.PyPDFExtractionMode.__str__"></a>
1238  
1239  #### PyPDFExtractionMode.\_\_str\_\_
1240  
1241  ```python
1242  def __str__() -> str
1243  ```
1244  
1245  Convert a PyPDFExtractionMode enum to a string.
1246  
1247  <a id="pypdf.PyPDFExtractionMode.from_str"></a>
1248  
1249  #### PyPDFExtractionMode.from\_str
1250  
1251  ```python
1252  @staticmethod
1253  def from_str(string: str) -> "PyPDFExtractionMode"
1254  ```
1255  
1256  Convert a string to a PyPDFExtractionMode enum.
1257  
1258  <a id="pypdf.PyPDFToDocument"></a>
1259  
1260  ### PyPDFToDocument
1261  
1262  Converts PDF files to documents your pipeline can query.
1263  
1264  This component uses the PyPDF library.
1265  You can attach metadata to the resulting documents.
1266  
1267  ### Usage example
1268  
1269  ```python
1270  from haystack.components.converters.pypdf import PyPDFToDocument
1271  
1272  converter = PyPDFToDocument()
1273  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1274  documents = results["documents"]
1275  print(documents[0].content)
1276  # 'This is a text from the PDF file.'
1277  ```
1278  
1279  <a id="pypdf.PyPDFToDocument.__init__"></a>
1280  
1281  #### PyPDFToDocument.\_\_init\_\_
1282  
1283  ```python
1284  def __init__(*,
1285               extraction_mode: str
1286               | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN,
1287               plain_mode_orientations: tuple = (0, 90, 180, 270),
1288               plain_mode_space_width: float = 200.0,
1289               layout_mode_space_vertically: bool = True,
1290               layout_mode_scale_weight: float = 1.25,
1291               layout_mode_strip_rotated: bool = True,
1292               layout_mode_font_height_weight: float = 1.0,
1293               store_full_path: bool = False)
1294  ```
1295  
1296  Create an PyPDFToDocument component.
1297  
1298  **Arguments**:
1299  
1300  - `extraction_mode`: The mode to use for extracting text from a PDF.
1301  Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1302  - `plain_mode_orientations`: Tuple of orientations to look for when extracting text from a PDF in plain mode.
1303  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1304  - `plain_mode_space_width`: Forces default space width if not extracted from font.
1305  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1306  - `layout_mode_space_vertically`: Whether to include blank lines inferred from y distance + font height.
1307  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1308  - `layout_mode_scale_weight`: Multiplier for string length when calculating weighted average character width.
1309  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1310  - `layout_mode_strip_rotated`: Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1311  If rotated text is discovered, layout will be degraded and a warning will be logged.
1312  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1313  - `layout_mode_font_height_weight`: Multiplier for font height when calculating blank line height.
1314  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1315  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1316  If False, only the file name is stored.
1317  
1318  <a id="pypdf.PyPDFToDocument.to_dict"></a>
1319  
1320  #### PyPDFToDocument.to\_dict
1321  
1322  ```python
1323  def to_dict()
1324  ```
1325  
1326  Serializes the component to a dictionary.
1327  
1328  **Returns**:
1329  
1330  Dictionary with serialized data.
1331  
1332  <a id="pypdf.PyPDFToDocument.from_dict"></a>
1333  
1334  #### PyPDFToDocument.from\_dict
1335  
1336  ```python
1337  @classmethod
1338  def from_dict(cls, data)
1339  ```
1340  
1341  Deserializes the component from a dictionary.
1342  
1343  **Arguments**:
1344  
1345  - `data`: Dictionary with serialized data.
1346  
1347  **Returns**:
1348  
1349  Deserialized component.
1350  
1351  <a id="pypdf.PyPDFToDocument.run"></a>
1352  
1353  #### PyPDFToDocument.run
1354  
1355  ```python
1356  @component.output_types(documents=list[Document])
1357  def run(sources: list[str | Path | ByteStream],
1358          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
1359  ```
1360  
1361  Converts PDF files to documents.
1362  
1363  **Arguments**:
1364  
1365  - `sources`: List of file paths or ByteStream objects to convert.
1366  - `meta`: Optional metadata to attach to the documents.
1367  This value can be a list of dictionaries or a single dictionary.
1368  If it's a single dictionary, its content is added to the metadata of all produced documents.
1369  If it's a list, its length must match the number of sources, as they are zipped together.
1370  For ByteStream objects, their `meta` is added to the output documents.
1371  
1372  **Returns**:
1373  
1374  A dictionary with the following keys:
1375  - `documents`: A list of converted documents.
1376  
1377  <a id="tika"></a>
1378  
1379  ## Module tika
1380  
1381  <a id="tika.XHTMLParser"></a>
1382  
1383  ### XHTMLParser
1384  
1385  Custom parser to extract pages from Tika XHTML content.
1386  
1387  <a id="tika.XHTMLParser.handle_starttag"></a>
1388  
1389  #### XHTMLParser.handle\_starttag
1390  
1391  ```python
1392  def handle_starttag(tag: str, attrs: list[tuple])
1393  ```
1394  
1395  Identify the start of a page div.
1396  
1397  <a id="tika.XHTMLParser.handle_endtag"></a>
1398  
1399  #### XHTMLParser.handle\_endtag
1400  
1401  ```python
1402  def handle_endtag(tag: str)
1403  ```
1404  
1405  Identify the end of a page div.
1406  
1407  <a id="tika.XHTMLParser.handle_data"></a>
1408  
1409  #### XHTMLParser.handle\_data
1410  
1411  ```python
1412  def handle_data(data: str)
1413  ```
1414  
1415  Populate the page content.
1416  
1417  <a id="tika.TikaDocumentConverter"></a>
1418  
1419  ### TikaDocumentConverter
1420  
1421  Converts files of different types to Documents.
1422  
1423  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1424  requires a running Tika server.
1425  For more options on running Tika,
1426  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1427  
1428  Usage example:
1429  ```python
1430  from haystack.components.converters.tika import TikaDocumentConverter
1431  
1432  converter = TikaDocumentConverter()
1433  results = converter.run(
1434      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1435      meta={"date_added": datetime.now().isoformat()}
1436  )
1437  documents = results["documents"]
1438  print(documents[0].content)
1439  # 'This is a text from the docx file.'
1440  ```
1441  
1442  <a id="tika.TikaDocumentConverter.__init__"></a>
1443  
1444  #### TikaDocumentConverter.\_\_init\_\_
1445  
1446  ```python
1447  def __init__(tika_url: str = "http://localhost:9998/tika",
1448               store_full_path: bool = False)
1449  ```
1450  
1451  Create a TikaDocumentConverter component.
1452  
1453  **Arguments**:
1454  
1455  - `tika_url`: Tika server URL.
1456  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1457  If False, only the file name is stored.
1458  
1459  <a id="tika.TikaDocumentConverter.run"></a>
1460  
1461  #### TikaDocumentConverter.run
1462  
1463  ```python
1464  @component.output_types(documents=list[Document])
1465  def run(sources: list[str | Path | ByteStream],
1466          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
1467  ```
1468  
1469  Converts files to Documents.
1470  
1471  **Arguments**:
1472  
1473  - `sources`: List of HTML file paths or ByteStream objects.
1474  - `meta`: Optional metadata to attach to the Documents.
1475  This value can be either a list of dictionaries or a single dictionary.
1476  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1477  If it's a list, the length of the list must match the number of sources, because the two lists will
1478  be zipped.
1479  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1480  
1481  **Returns**:
1482  
1483  A dictionary with the following keys:
1484  - `documents`: Created Documents
1485  
1486  <a id="txt"></a>
1487  
1488  ## Module txt
1489  
1490  <a id="txt.TextFileToDocument"></a>
1491  
1492  ### TextFileToDocument
1493  
1494  Converts text files to documents your pipeline can query.
1495  
1496  By default, it uses UTF-8 encoding when converting files but
1497  you can also set custom encoding.
1498  It can attach metadata to the resulting documents.
1499  
1500  ### Usage example
1501  
1502  ```python
1503  from haystack.components.converters.txt import TextFileToDocument
1504  
1505  converter = TextFileToDocument()
1506  results = converter.run(sources=["sample.txt"])
1507  documents = results["documents"]
1508  print(documents[0].content)
1509  # 'This is the content from the txt file.'
1510  ```
1511  
1512  <a id="txt.TextFileToDocument.__init__"></a>
1513  
1514  #### TextFileToDocument.\_\_init\_\_
1515  
1516  ```python
1517  def __init__(encoding: str = "utf-8", store_full_path: bool = False)
1518  ```
1519  
1520  Creates a TextFileToDocument component.
1521  
1522  **Arguments**:
1523  
1524  - `encoding`: The encoding of the text files to convert.
1525  If the encoding is specified in the metadata of a source ByteStream,
1526  it overrides this value.
1527  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1528  If False, only the file name is stored.
1529  
1530  <a id="txt.TextFileToDocument.run"></a>
1531  
1532  #### TextFileToDocument.run
1533  
1534  ```python
1535  @component.output_types(documents=list[Document])
1536  def run(sources: list[str | Path | ByteStream],
1537          meta: dict[str, Any] | list[dict[str, Any]] | None = None)
1538  ```
1539  
1540  Converts text files to documents.
1541  
1542  **Arguments**:
1543  
1544  - `sources`: List of text file paths or ByteStream objects to convert.
1545  - `meta`: Optional metadata to attach to the documents.
1546  This value can be a list of dictionaries or a single dictionary.
1547  If it's a single dictionary, its content is added to the metadata of all produced documents.
1548  If it's a list, its length must match the number of sources as they're zipped together.
1549  For ByteStream objects, their `meta` is added to the output documents.
1550  
1551  **Returns**:
1552  
1553  A dictionary with the following keys:
1554  - `documents`: A list of converted documents.
1555  
1556  <a id="xlsx"></a>
1557  
1558  ## Module xlsx
1559  
1560  <a id="xlsx.XLSXToDocument"></a>
1561  
1562  ### XLSXToDocument
1563  
1564  Converts XLSX (Excel) files into Documents.
1565  
1566      Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1567      created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1568  
1569      ### Usage example
1570  
1571      ```python
1572      from haystack.components.converters.xlsx import XLSXToDocument
1573  
1574      converter = XLSXToDocument()
1575      results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
1576      documents = results["documents"]
1577      print(documents[0].content)
1578      # ",A,B
1579  1,col_a,col_b
1580  2,1.5,test
1581  "
1582      ```
1583  
1584  <a id="xlsx.XLSXToDocument.__init__"></a>
1585  
1586  #### XLSXToDocument.\_\_init\_\_
1587  
1588  ```python
1589  def __init__(table_format: Literal["csv", "markdown"] = "csv",
1590               sheet_name: str | int | list[str | int] | None = None,
1591               read_excel_kwargs: dict[str, Any] | None = None,
1592               table_format_kwargs: dict[str, Any] | None = None,
1593               *,
1594               store_full_path: bool = False)
1595  ```
1596  
1597  Creates a XLSXToDocument component.
1598  
1599  **Arguments**:
1600  
1601  - `table_format`: The format to convert the Excel file to.
1602  - `sheet_name`: The name of the sheet to read. If None, all sheets are read.
1603  - `read_excel_kwargs`: Additional arguments to pass to `pandas.read_excel`.
1604  See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1605  - `table_format_kwargs`: Additional keyword arguments to pass to the table format function.
1606  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1607    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1608  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1609    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1610  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1611  If False, only the file name is stored.
1612  
1613  <a id="xlsx.XLSXToDocument.run"></a>
1614  
1615  #### XLSXToDocument.run
1616  
1617  ```python
1618  @component.output_types(documents=list[Document])
1619  def run(
1620      sources: list[str | Path | ByteStream],
1621      meta: dict[str, Any] | list[dict[str, Any]] | None = None
1622  ) -> dict[str, list[Document]]
1623  ```
1624  
1625  Converts a XLSX file to a Document.
1626  
1627  **Arguments**:
1628  
1629  - `sources`: List of file paths or ByteStream objects.
1630  - `meta`: Optional metadata to attach to the documents.
1631  This value can be either a list of dictionaries or a single dictionary.
1632  If it's a single dictionary, its content is added to the metadata of all produced documents.
1633  If it's a list, the length of the list must match the number of sources, because the two lists will
1634  be zipped.
1635  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1636  
1637  **Returns**:
1638  
1639  A dictionary with the following keys:
1640  - `documents`: Created documents
1641