Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.20 / haystack-api / converters_api.md
converters_api.md
   1  ---
   2  title: "Converters"
   3  id: converters-api
   4  description: "Various converters to transform data from one format to another."
   5  slug: "/converters-api"
   6  ---
   7  
   8  <a id="azure"></a>
   9  
  10  ## Module azure
  11  
  12  <a id="azure.AzureOCRDocumentConverter"></a>
  13  
  14  ### AzureOCRDocumentConverter
  15  
  16  Converts files to documents using Azure's Document Intelligence service.
  17  
  18  Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
  19  
  20  To use this component, you need an active Azure account
  21  and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see
  22  [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).
  23  
  24  ### Usage example
  25  
  26  ```python
  27  from haystack.components.converters import AzureOCRDocumentConverter
  28  from haystack.utils import Secret
  29  
  30  converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
  31  results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
  32  documents = results["documents"]
  33  print(documents[0].content)
  34  # 'This is a text from the PDF file.'
  35  ```
  36  
  37  <a id="azure.AzureOCRDocumentConverter.__init__"></a>
  38  
  39  #### AzureOCRDocumentConverter.\_\_init\_\_
  40  
  41  ```python
  42  def __init__(endpoint: str,
  43               api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
  44               model_id: str = "prebuilt-read",
  45               preceding_context_len: int = 3,
  46               following_context_len: int = 3,
  47               merge_multiple_column_headers: bool = True,
  48               page_layout: Literal["natural", "single_column"] = "natural",
  49               threshold_y: Optional[float] = 0.05,
  50               store_full_path: bool = False)
  51  ```
  52  
  53  Creates an AzureOCRDocumentConverter component.
  54  
  55  **Arguments**:
  56  
  57  - `endpoint`: The endpoint of your Azure resource.
  58  - `api_key`: The API key of your Azure resource.
  59  - `model_id`: The ID of the model you want to use. For a list of available models, see [Azure documentation]
  60  (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  61  - `preceding_context_len`: Number of lines before a table to include as preceding context
  62  (this will be added to the metadata).
  63  - `following_context_len`: Number of lines after a table to include as subsequent context (
  64  this will be added to the metadata).
  65  - `merge_multiple_column_headers`: If `True`, merges multiple column header rows into a single row.
  66  - `page_layout`: The type reading order to follow. Possible options:
  67  - `natural`: Uses the natural reading order determined by Azure.
  68  - `single_column`: Groups all lines with the same height on the page based on a threshold
  69  determined by `threshold_y`.
  70  - `threshold_y`: Only relevant if `single_column` is set to `page_layout`.
  71  The threshold, in inches, to determine if two recognized PDF elements are grouped into a
  72  single line. This is crucial for section headers or numbers which may be spatially separated
  73  from the remaining text on the horizontal axis.
  74  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
  75  If False, only the file name is stored.
  76  
  77  <a id="azure.AzureOCRDocumentConverter.run"></a>
  78  
  79  #### AzureOCRDocumentConverter.run
  80  
  81  ```python
  82  @component.output_types(documents=list[Document],
  83                          raw_azure_response=list[dict])
  84  def run(sources: list[Union[str, Path, ByteStream]],
  85          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
  86  ```
  87  
  88  Convert a list of files to Documents using Azure's Document Intelligence service.
  89  
  90  **Arguments**:
  91  
  92  - `sources`: List of file paths or ByteStream objects.
  93  - `meta`: Optional metadata to attach to the Documents.
  94  This value can be either a list of dictionaries or a single dictionary.
  95  If it's a single dictionary, its content is added to the metadata of all produced Documents.
  96  If it's a list, the length of the list must match the number of sources, because the two lists will be
  97  zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
  98  
  99  **Returns**:
 100  
 101  A dictionary with the following keys:
 102  - `documents`: List of created Documents
 103  - `raw_azure_response`: List of raw Azure responses used to create the Documents
 104  
 105  <a id="azure.AzureOCRDocumentConverter.to_dict"></a>
 106  
 107  #### AzureOCRDocumentConverter.to\_dict
 108  
 109  ```python
 110  def to_dict() -> dict[str, Any]
 111  ```
 112  
 113  Serializes the component to a dictionary.
 114  
 115  **Returns**:
 116  
 117  Dictionary with serialized data.
 118  
 119  <a id="azure.AzureOCRDocumentConverter.from_dict"></a>
 120  
 121  #### AzureOCRDocumentConverter.from\_dict
 122  
 123  ```python
 124  @classmethod
 125  def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter"
 126  ```
 127  
 128  Deserializes the component from a dictionary.
 129  
 130  **Arguments**:
 131  
 132  - `data`: The dictionary to deserialize from.
 133  
 134  **Returns**:
 135  
 136  The deserialized component.
 137  
 138  <a id="csv"></a>
 139  
 140  ## Module csv
 141  
 142  <a id="csv.CSVToDocument"></a>
 143  
 144  ### CSVToDocument
 145  
 146  Converts CSV files to Documents.
 147  
 148  By default, it uses UTF-8 encoding when converting files but
 149  you can also set a custom encoding.
 150  It can attach metadata to the resulting documents.
 151  
 152  ### Usage example
 153  
 154  ```python
 155  from haystack.components.converters.csv import CSVToDocument
 156  converter = CSVToDocument()
 157  results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
 158  documents = results["documents"]
 159  print(documents[0].content)
 160  # 'col1,col2\nrow1,row1\nrow2,row2\n'
 161  ```
 162  
 163  <a id="csv.CSVToDocument.__init__"></a>
 164  
 165  #### CSVToDocument.\_\_init\_\_
 166  
 167  ```python
 168  def __init__(encoding: str = "utf-8",
 169               store_full_path: bool = False,
 170               *,
 171               conversion_mode: Literal["file", "row"] = "file",
 172               delimiter: str = ",",
 173               quotechar: str = '"')
 174  ```
 175  
 176  Creates a CSVToDocument component.
 177  
 178  **Arguments**:
 179  
 180  - `encoding`: The encoding of the csv files to convert.
 181  If the encoding is specified in the metadata of a source ByteStream,
 182  it overrides this value.
 183  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 184  If False, only the file name is stored.
 185  - `conversion_mode`: - "file" (default): one Document per CSV file whose content is the raw CSV text.
 186  - "row": convert each CSV row to its own Document (requires `content_column` in `run()`).
 187  - `delimiter`: CSV delimiter used when parsing in row mode (passed to ``csv.DictReader``).
 188  - `quotechar`: CSV quote character used when parsing in row mode (passed to ``csv.DictReader``).
 189  
 190  <a id="csv.CSVToDocument.run"></a>
 191  
 192  #### CSVToDocument.run
 193  
 194  ```python
 195  @component.output_types(documents=list[Document])
 196  def run(sources: list[Union[str, Path, ByteStream]],
 197          *,
 198          content_column: Optional[str] = None,
 199          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 200  ```
 201  
 202  Converts CSV files to a Document (file mode) or to one Document per row (row mode).
 203  
 204  **Arguments**:
 205  
 206  - `sources`: List of file paths or ByteStream objects.
 207  - `content_column`: **Required when** ``conversion_mode="row"``.
 208  The column name whose values become ``Document.content`` for each row.
 209  The column must exist in the CSV header.
 210  - `meta`: Optional metadata to attach to the documents.
 211  This value can be either a list of dictionaries or a single dictionary.
 212  If it's a single dictionary, its content is added to the metadata of all produced documents.
 213  If it's a list, the length of the list must match the number of sources, because the two lists will
 214  be zipped.
 215  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
 216  
 217  **Returns**:
 218  
 219  A dictionary with the following keys:
 220  - `documents`: Created documents
 221  
 222  <a id="docx"></a>
 223  
 224  ## Module docx
 225  
 226  <a id="docx.DOCXMetadata"></a>
 227  
 228  ### DOCXMetadata
 229  
 230  Describes the metadata of Docx file.
 231  
 232  **Arguments**:
 233  
 234  - `author`: The author
 235  - `category`: The category
 236  - `comments`: The comments
 237  - `content_status`: The content status
 238  - `created`: The creation date (ISO formatted string)
 239  - `identifier`: The identifier
 240  - `keywords`: Available keywords
 241  - `language`: The language of the document
 242  - `last_modified_by`: User who last modified the document
 243  - `last_printed`: The last printed date (ISO formatted string)
 244  - `modified`: The last modification date (ISO formatted string)
 245  - `revision`: The revision number
 246  - `subject`: The subject
 247  - `title`: The title
 248  - `version`: The version
 249  
 250  <a id="docx.DOCXTableFormat"></a>
 251  
 252  ### DOCXTableFormat
 253  
 254  Supported formats for storing DOCX tabular data in a Document.
 255  
 256  <a id="docx.DOCXTableFormat.from_str"></a>
 257  
 258  #### DOCXTableFormat.from\_str
 259  
 260  ```python
 261  @staticmethod
 262  def from_str(string: str) -> "DOCXTableFormat"
 263  ```
 264  
 265  Convert a string to a DOCXTableFormat enum.
 266  
 267  <a id="docx.DOCXLinkFormat"></a>
 268  
 269  ### DOCXLinkFormat
 270  
 271  Supported formats for storing DOCX link information in a Document.
 272  
 273  <a id="docx.DOCXLinkFormat.from_str"></a>
 274  
 275  #### DOCXLinkFormat.from\_str
 276  
 277  ```python
 278  @staticmethod
 279  def from_str(string: str) -> "DOCXLinkFormat"
 280  ```
 281  
 282  Convert a string to a DOCXLinkFormat enum.
 283  
 284  <a id="docx.DOCXToDocument"></a>
 285  
 286  ### DOCXToDocument
 287  
 288  Converts DOCX files to Documents.
 289  
 290  Uses `python-docx` library to convert the DOCX file to a document.
 291  This component does not preserve page breaks in the original document.
 292  
 293  Usage example:
 294  ```python
 295  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat
 296  
 297  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
 298  results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
 299  documents = results["documents"]
 300  print(documents[0].content)
 301  # 'This is a text from the DOCX file.'
 302  ```
 303  
 304  <a id="docx.DOCXToDocument.__init__"></a>
 305  
 306  #### DOCXToDocument.\_\_init\_\_
 307  
 308  ```python
 309  def __init__(table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.CSV,
 310               link_format: Union[str, DOCXLinkFormat] = DOCXLinkFormat.NONE,
 311               store_full_path: bool = False)
 312  ```
 313  
 314  Create a DOCXToDocument component.
 315  
 316  **Arguments**:
 317  
 318  - `table_format`: The format for table output. Can be either DOCXTableFormat.MARKDOWN,
 319  DOCXTableFormat.CSV, "markdown", or "csv".
 320  - `link_format`: The format for link output. Can be either:
 321  DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`,
 322  DOCXLinkFormat.PLAIN or "plain" to get text (address),
 323  DOCXLinkFormat.NONE or "none" to get text without links.
 324  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 325  If False, only the file name is stored.
 326  
 327  <a id="docx.DOCXToDocument.to_dict"></a>
 328  
 329  #### DOCXToDocument.to\_dict
 330  
 331  ```python
 332  def to_dict() -> dict[str, Any]
 333  ```
 334  
 335  Serializes the component to a dictionary.
 336  
 337  **Returns**:
 338  
 339  Dictionary with serialized data.
 340  
 341  <a id="docx.DOCXToDocument.from_dict"></a>
 342  
 343  #### DOCXToDocument.from\_dict
 344  
 345  ```python
 346  @classmethod
 347  def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument"
 348  ```
 349  
 350  Deserializes the component from a dictionary.
 351  
 352  **Arguments**:
 353  
 354  - `data`: The dictionary to deserialize from.
 355  
 356  **Returns**:
 357  
 358  The deserialized component.
 359  
 360  <a id="docx.DOCXToDocument.run"></a>
 361  
 362  #### DOCXToDocument.run
 363  
 364  ```python
 365  @component.output_types(documents=list[Document])
 366  def run(sources: list[Union[str, Path, ByteStream]],
 367          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 368  ```
 369  
 370  Converts DOCX files to Documents.
 371  
 372  **Arguments**:
 373  
 374  - `sources`: List of file paths or ByteStream objects.
 375  - `meta`: Optional metadata to attach to the Documents.
 376  This value can be either a list of dictionaries or a single dictionary.
 377  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 378  If it's a list, the length of the list must match the number of sources, because the two lists will
 379  be zipped.
 380  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 381  
 382  **Returns**:
 383  
 384  A dictionary with the following keys:
 385  - `documents`: Created Documents
 386  
 387  <a id="html"></a>
 388  
 389  ## Module html
 390  
 391  <a id="html.HTMLToDocument"></a>
 392  
 393  ### HTMLToDocument
 394  
 395  Converts an HTML file to a Document.
 396  
 397  Usage example:
 398  ```python
 399  from haystack.components.converters import HTMLToDocument
 400  
 401  converter = HTMLToDocument()
 402  results = converter.run(sources=["path/to/sample.html"])
 403  documents = results["documents"]
 404  print(documents[0].content)
 405  # 'This is a text from the HTML file.'
 406  ```
 407  
 408  <a id="html.HTMLToDocument.__init__"></a>
 409  
 410  #### HTMLToDocument.\_\_init\_\_
 411  
 412  ```python
 413  def __init__(extraction_kwargs: Optional[dict[str, Any]] = None,
 414               store_full_path: bool = False)
 415  ```
 416  
 417  Create an HTMLToDocument component.
 418  
 419  **Arguments**:
 420  
 421  - `extraction_kwargs`: A dictionary containing keyword arguments to customize the extraction process. These
 422  are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see
 423  the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
 424  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 425  If False, only the file name is stored.
 426  
 427  <a id="html.HTMLToDocument.to_dict"></a>
 428  
 429  #### HTMLToDocument.to\_dict
 430  
 431  ```python
 432  def to_dict() -> dict[str, Any]
 433  ```
 434  
 435  Serializes the component to a dictionary.
 436  
 437  **Returns**:
 438  
 439  Dictionary with serialized data.
 440  
 441  <a id="html.HTMLToDocument.from_dict"></a>
 442  
 443  #### HTMLToDocument.from\_dict
 444  
 445  ```python
 446  @classmethod
 447  def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument"
 448  ```
 449  
 450  Deserializes the component from a dictionary.
 451  
 452  **Arguments**:
 453  
 454  - `data`: The dictionary to deserialize from.
 455  
 456  **Returns**:
 457  
 458  The deserialized component.
 459  
 460  <a id="html.HTMLToDocument.run"></a>
 461  
 462  #### HTMLToDocument.run
 463  
 464  ```python
 465  @component.output_types(documents=list[Document])
 466  def run(sources: list[Union[str, Path, ByteStream]],
 467          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None,
 468          extraction_kwargs: Optional[dict[str, Any]] = None)
 469  ```
 470  
 471  Converts a list of HTML files to Documents.
 472  
 473  **Arguments**:
 474  
 475  - `sources`: List of HTML file paths or ByteStream objects.
 476  - `meta`: Optional metadata to attach to the Documents.
 477  This value can be either a list of dictionaries or a single dictionary.
 478  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 479  If it's a list, the length of the list must match the number of sources, because the two lists will
 480  be zipped.
 481  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 482  - `extraction_kwargs`: Additional keyword arguments to customize the extraction process.
 483  
 484  **Returns**:
 485  
 486  A dictionary with the following keys:
 487  - `documents`: Created Documents
 488  
 489  <a id="json"></a>
 490  
 491  ## Module json
 492  
 493  <a id="json.JSONConverter"></a>
 494  
 495  ### JSONConverter
 496  
 497  Converts one or more JSON files into a text document.
 498  
 499  ### Usage examples
 500  
 501  ```python
 502  import json
 503  
 504  from haystack.components.converters import JSONConverter
 505  from haystack.dataclasses import ByteStream
 506  
 507  source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
 508  
 509  converter = JSONConverter(content_key="text")
 510  results = converter.run(sources=[source])
 511  documents = results["documents"]
 512  print(documents[0].content)
 513  # 'This is the content of my document'
 514  ```
 515  
 516  Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields`
 517  to extract from the filtered data:
 518  
 519  ```python
 520  import json
 521  
 522  from haystack.components.converters import JSONConverter
 523  from haystack.dataclasses import ByteStream
 524  
 525  data = {
 526      "laureates": [
 527          {
 528              "firstname": "Enrico",
 529              "surname": "Fermi",
 530              "motivation": "for his demonstrations of the existence of new radioactive elements produced "
 531              "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
 532              " slow neutrons",
 533          },
 534          {
 535              "firstname": "Rita",
 536              "surname": "Levi-Montalcini",
 537              "motivation": "for their discoveries of growth factors",
 538          },
 539      ],
 540  }
 541  source = ByteStream.from_string(json.dumps(data))
 542  converter = JSONConverter(
 543      jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
 544  )
 545  
 546  results = converter.run(sources=[source])
 547  documents = results["documents"]
 548  print(documents[0].content)
 549  # 'for his demonstrations of the existence of new radioactive elements produced by
 550  # neutron irradiation, and for his related discovery of nuclear reactions brought
 551  # about by slow neutrons'
 552  
 553  print(documents[0].meta)
 554  # {'firstname': 'Enrico', 'surname': 'Fermi'}
 555  
 556  print(documents[1].content)
 557  # 'for their discoveries of growth factors'
 558  
 559  print(documents[1].meta)
 560  # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
 561  ```
 562  
 563  <a id="json.JSONConverter.__init__"></a>
 564  
 565  #### JSONConverter.\_\_init\_\_
 566  
 567  ```python
 568  def __init__(jq_schema: Optional[str] = None,
 569               content_key: Optional[str] = None,
 570               extra_meta_fields: Optional[Union[set[str], Literal["*"]]] = None,
 571               store_full_path: bool = False)
 572  ```
 573  
 574  Creates a JSONConverter component.
 575  
 576  An optional `jq_schema` can be provided to extract nested data in the JSON source files.
 577  See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax.
 578  If `jq_schema` is not set, whole JSON source files will be used to extract content.
 579  
 580  Optionally, you can provide a `content_key` to specify which key in the extracted object must
 581  be set as the document's content.
 582  
 583  If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in
 584  the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped.
 585  
 586  If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array,
 587  it will be skipped.
 588  
 589  If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped.
 590  
 591  `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string.
 592  If it's a set of strings, it must specify fields in the extracted objects that must be set in
 593  the extracted documents. If a field is not found, the meta value will be `None`.
 594  If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will
 595  be saved as metadata.
 596  
 597  Initialization will fail if neither `jq_schema` nor `content_key` are set.
 598  
 599  **Arguments**:
 600  
 601  - `jq_schema`: Optional jq filter string to extract content.
 602  If not specified, whole JSON object will be used to extract information.
 603  - `content_key`: Optional key to extract document content.
 604  If `jq_schema` is specified, the `content_key` will be extracted from that object.
 605  - `extra_meta_fields`: An optional set of meta keys to extract from the content.
 606  If `jq_schema` is specified, all keys will be extracted from that object.
 607  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 608  If False, only the file name is stored.
 609  
 610  <a id="json.JSONConverter.to_dict"></a>
 611  
 612  #### JSONConverter.to\_dict
 613  
 614  ```python
 615  def to_dict() -> dict[str, Any]
 616  ```
 617  
 618  Serializes the component to a dictionary.
 619  
 620  **Returns**:
 621  
 622  Dictionary with serialized data.
 623  
 624  <a id="json.JSONConverter.from_dict"></a>
 625  
 626  #### JSONConverter.from\_dict
 627  
 628  ```python
 629  @classmethod
 630  def from_dict(cls, data: dict[str, Any]) -> "JSONConverter"
 631  ```
 632  
 633  Deserializes the component from a dictionary.
 634  
 635  **Arguments**:
 636  
 637  - `data`: Dictionary to deserialize from.
 638  
 639  **Returns**:
 640  
 641  Deserialized component.
 642  
 643  <a id="json.JSONConverter.run"></a>
 644  
 645  #### JSONConverter.run
 646  
 647  ```python
 648  @component.output_types(documents=list[Document])
 649  def run(sources: list[Union[str, Path, ByteStream]],
 650          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 651  ```
 652  
 653  Converts a list of JSON files to documents.
 654  
 655  **Arguments**:
 656  
 657  - `sources`: A list of file paths or ByteStream objects.
 658  - `meta`: Optional metadata to attach to the documents.
 659  This value can be either a list of dictionaries or a single dictionary.
 660  If it's a single dictionary, its content is added to the metadata of all produced documents.
 661  If it's a list, the length of the list must match the number of sources.
 662  If `sources` contain ByteStream objects, their `meta` will be added to the output documents.
 663  
 664  **Returns**:
 665  
 666  A dictionary with the following keys:
 667  - `documents`: A list of created documents.
 668  
 669  <a id="markdown"></a>
 670  
 671  ## Module markdown
 672  
 673  <a id="markdown.MarkdownToDocument"></a>
 674  
 675  ### MarkdownToDocument
 676  
 677  Converts a Markdown file into a text Document.
 678  
 679  Usage example:
 680  ```python
 681  from haystack.components.converters import MarkdownToDocument
 682  from datetime import datetime
 683  
 684  converter = MarkdownToDocument()
 685  results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
 686  documents = results["documents"]
 687  print(documents[0].content)
 688  # 'This is a text from the markdown file.'
 689  ```
 690  
 691  <a id="markdown.MarkdownToDocument.__init__"></a>
 692  
 693  #### MarkdownToDocument.\_\_init\_\_
 694  
 695  ```python
 696  def __init__(table_to_single_line: bool = False,
 697               progress_bar: bool = True,
 698               store_full_path: bool = False)
 699  ```
 700  
 701  Create a MarkdownToDocument component.
 702  
 703  **Arguments**:
 704  
 705  - `table_to_single_line`: If True converts table contents into a single line.
 706  - `progress_bar`: If True shows a progress bar when running.
 707  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 708  If False, only the file name is stored.
 709  
 710  <a id="markdown.MarkdownToDocument.run"></a>
 711  
 712  #### MarkdownToDocument.run
 713  
 714  ```python
 715  @component.output_types(documents=list[Document])
 716  def run(sources: list[Union[str, Path, ByteStream]],
 717          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
 718  ```
 719  
 720  Converts a list of Markdown files to Documents.
 721  
 722  **Arguments**:
 723  
 724  - `sources`: List of file paths or ByteStream objects.
 725  - `meta`: Optional metadata to attach to the Documents.
 726  This value can be either a list of dictionaries or a single dictionary.
 727  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 728  If it's a list, the length of the list must match the number of sources, because the two lists will
 729  be zipped.
 730  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 731  
 732  **Returns**:
 733  
 734  A dictionary with the following keys:
 735  - `documents`: List of created Documents
 736  
 737  <a id="msg"></a>
 738  
 739  ## Module msg
 740  
 741  <a id="msg.MSGToDocument"></a>
 742  
 743  ### MSGToDocument
 744  
 745  Converts Microsoft Outlook .msg files into Haystack Documents.
 746  
 747  This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg
 748  files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg
 749  file are extracted as ByteStream objects.
 750  
 751  ### Example Usage
 752  
 753  ```python
 754  from haystack.components.converters.msg import MSGToDocument
 755  from datetime import datetime
 756  
 757  converter = MSGToDocument()
 758  results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
 759  documents = results["documents"]
 760  attachments = results["attachments"]
 761  print(documents[0].content)
 762  ```
 763  
 764  <a id="msg.MSGToDocument.__init__"></a>
 765  
 766  #### MSGToDocument.\_\_init\_\_
 767  
 768  ```python
 769  def __init__(store_full_path: bool = False) -> None
 770  ```
 771  
 772  Creates a MSGToDocument component.
 773  
 774  **Arguments**:
 775  
 776  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
 777  If False, only the file name is stored.
 778  
 779  <a id="msg.MSGToDocument.run"></a>
 780  
 781  #### MSGToDocument.run
 782  
 783  ```python
 784  @component.output_types(documents=list[Document], attachments=list[ByteStream])
 785  def run(
 786      sources: list[Union[str, Path, ByteStream]],
 787      meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
 788  ) -> dict[str, Union[list[Document], list[ByteStream]]]
 789  ```
 790  
 791  Converts MSG files to Documents.
 792  
 793  **Arguments**:
 794  
 795  - `sources`: List of file paths or ByteStream objects.
 796  - `meta`: Optional metadata to attach to the Documents.
 797  This value can be either a list of dictionaries or a single dictionary.
 798  If it's a single dictionary, its content is added to the metadata of all produced Documents.
 799  If it's a list, the length of the list must match the number of sources, because the two lists will
 800  be zipped.
 801  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
 802  
 803  **Returns**:
 804  
 805  A dictionary with the following keys:
 806  - `documents`: Created Documents.
 807  - `attachments`: Created ByteStream objects from file attachments.
 808  
 809  <a id="multi_file_converter"></a>
 810  
 811  ## Module multi\_file\_converter
 812  
 813  <a id="multi_file_converter.MultiFileConverter"></a>
 814  
 815  ### MultiFileConverter
 816  
 817  A file converter that handles conversion of multiple file types.
 818  
 819  The MultiFileConverter handles the following file types:
 820  - CSV
 821  - DOCX
 822  - HTML
 823  - JSON
 824  - MD
 825  - TEXT
 826  - PDF (no OCR)
 827  - PPTX
 828  - XLSX
 829  
 830  Usage example:
 831  ```
 832  from haystack.super_components.converters import MultiFileConverter
 833  
 834  converter = MultiFileConverter()
 835  converter.run(sources=["test.txt", "test.pdf"], meta={})
 836  ```
 837  
 838  <a id="multi_file_converter.MultiFileConverter.__init__"></a>
 839  
 840  #### MultiFileConverter.\_\_init\_\_
 841  
 842  ```python
 843  def __init__(encoding: str = "utf-8",
 844               json_content_key: str = "content") -> None
 845  ```
 846  
 847  Initialize the MultiFileConverter.
 848  
 849  **Arguments**:
 850  
 851  - `encoding`: The encoding to use when reading files.
 852  - `json_content_key`: The key to use in a content field in a document when converting JSON files.
 853  
 854  <a id="openapi_functions"></a>
 855  
 856  ## Module openapi\_functions
 857  
 858  <a id="openapi_functions.OpenAPIServiceToFunctions"></a>
 859  
 860  ### OpenAPIServiceToFunctions
 861  
 862  Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
 863  
 864  The definition must respect OpenAPI specification 3.0.0 or higher.
 865  It can be specified in JSON or YAML format.
 866  Each function must have:
 867      - unique operationId
 868      - description
 869      - requestBody and/or parameters
 870      - schema for the requestBody and/or parameters
 871  For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification).
 872  For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling).
 873  
 874  Usage example:
 875  ```python
 876  from haystack.components.converters import OpenAPIServiceToFunctions
 877  
 878  converter = OpenAPIServiceToFunctions()
 879  result = converter.run(sources=["path/to/openapi_definition.yaml"])
 880  assert result["functions"]
 881  ```
 882  
 883  <a id="openapi_functions.OpenAPIServiceToFunctions.__init__"></a>
 884  
 885  #### OpenAPIServiceToFunctions.\_\_init\_\_
 886  
 887  ```python
 888  def __init__()
 889  ```
 890  
 891  Create an OpenAPIServiceToFunctions component.
 892  
 893  <a id="openapi_functions.OpenAPIServiceToFunctions.run"></a>
 894  
 895  #### OpenAPIServiceToFunctions.run
 896  
 897  ```python
 898  @component.output_types(functions=list[dict[str, Any]],
 899                          openapi_specs=list[dict[str, Any]])
 900  def run(sources: list[Union[str, Path, ByteStream]]) -> dict[str, Any]
 901  ```
 902  
 903  Converts OpenAPI definitions in OpenAI function calling format.
 904  
 905  **Arguments**:
 906  
 907  - `sources`: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
 908  
 909  **Raises**:
 910  
 911  - `RuntimeError`: If the OpenAPI definitions cannot be downloaded or processed.
 912  - `ValueError`: If the source type is not recognized or no functions are found in the OpenAPI definitions.
 913  
 914  **Returns**:
 915  
 916  A dictionary with the following keys:
 917  - functions: Function definitions in JSON object format
 918  - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
 919  
 920  <a id="output_adapter"></a>
 921  
 922  ## Module output\_adapter
 923  
 924  <a id="output_adapter.OutputAdaptationException"></a>
 925  
 926  ### OutputAdaptationException
 927  
 928  Exception raised when there is an error during output adaptation.
 929  
 930  <a id="output_adapter.OutputAdapter"></a>
 931  
 932  ### OutputAdapter
 933  
 934  Adapts output of a Component using Jinja templates.
 935  
 936  Usage example:
 937  ```python
 938  from haystack import Document
 939  from haystack.components.converters import OutputAdapter
 940  
 941  adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
 942  documents = [Document(content="Test content"]
 943  result = adapter.run(documents=documents)
 944  
 945  assert result["output"] == "Test content"
 946  ```
 947  
 948  <a id="output_adapter.OutputAdapter.__init__"></a>
 949  
 950  #### OutputAdapter.\_\_init\_\_
 951  
 952  ```python
 953  def __init__(template: str,
 954               output_type: TypeAlias,
 955               custom_filters: Optional[dict[str, Callable]] = None,
 956               unsafe: bool = False)
 957  ```
 958  
 959  Create an OutputAdapter component.
 960  
 961  **Arguments**:
 962  
 963  - `template`: A Jinja template that defines how to adapt the input data.
 964  The variables in the template define the input of this instance.
 965  e.g.
 966  With this template:
 967  ```
 968  {{ documents[0].content }}
 969  ```
 970  The Component input will be `documents`.
 971  - `output_type`: The type of output this instance will return.
 972  - `custom_filters`: A dictionary of custom Jinja filters used in the template.
 973  - `unsafe`: Enable execution of arbitrary code in the Jinja template.
 974  This should only be used if you trust the source of the template as it can be lead to remote code execution.
 975  
 976  <a id="output_adapter.OutputAdapter.run"></a>
 977  
 978  #### OutputAdapter.run
 979  
 980  ```python
 981  def run(**kwargs)
 982  ```
 983  
 984  Renders the Jinja template with the provided inputs.
 985  
 986  **Arguments**:
 987  
 988  - `kwargs`: Must contain all variables used in the `template` string.
 989  
 990  **Raises**:
 991  
 992  - `OutputAdaptationException`: If template rendering fails.
 993  
 994  **Returns**:
 995  
 996  A dictionary with the following keys:
 997  - `output`: Rendered Jinja template.
 998  
 999  <a id="output_adapter.OutputAdapter.to_dict"></a>
1000  
1001  #### OutputAdapter.to\_dict
1002  
1003  ```python
1004  def to_dict() -> dict[str, Any]
1005  ```
1006  
1007  Serializes the component to a dictionary.
1008  
1009  **Returns**:
1010  
1011  Dictionary with serialized data.
1012  
1013  <a id="output_adapter.OutputAdapter.from_dict"></a>
1014  
1015  #### OutputAdapter.from\_dict
1016  
1017  ```python
1018  @classmethod
1019  def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter"
1020  ```
1021  
1022  Deserializes the component from a dictionary.
1023  
1024  **Arguments**:
1025  
1026  - `data`: The dictionary to deserialize from.
1027  
1028  **Returns**:
1029  
1030  The deserialized component.
1031  
1032  <a id="pdfminer"></a>
1033  
1034  ## Module pdfminer
1035  
1036  <a id="pdfminer.CID_PATTERN"></a>
1037  
1038  #### CID\_PATTERN
1039  
1040  regex pattern to detect CID characters
1041  
1042  <a id="pdfminer.PDFMinerToDocument"></a>
1043  
1044  ### PDFMinerToDocument
1045  
1046  Converts PDF files to Documents.
1047  
1048  Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
1049  
1050  Usage example:
1051  ```python
1052  from haystack.components.converters.pdfminer import PDFMinerToDocument
1053  
1054  converter = PDFMinerToDocument()
1055  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1056  documents = results["documents"]
1057  print(documents[0].content)
1058  # 'This is a text from the PDF file.'
1059  ```
1060  
1061  <a id="pdfminer.PDFMinerToDocument.__init__"></a>
1062  
1063  #### PDFMinerToDocument.\_\_init\_\_
1064  
1065  ```python
1066  def __init__(line_overlap: float = 0.5,
1067               char_margin: float = 2.0,
1068               line_margin: float = 0.5,
1069               word_margin: float = 0.1,
1070               boxes_flow: Optional[float] = 0.5,
1071               detect_vertical: bool = True,
1072               all_texts: bool = False,
1073               store_full_path: bool = False) -> None
1074  ```
1075  
1076  Create a PDFMinerToDocument component.
1077  
1078  **Arguments**:
1079  
1080  - `line_overlap`: This parameter determines whether two characters are considered to be on
1081  the same line based on the amount of overlap between them.
1082  The overlap is calculated relative to the minimum height of both characters.
1083  - `char_margin`: Determines whether two characters are part of the same line based on the distance between them.
1084  If the distance is less than the margin specified, the characters are considered to be on the same line.
1085  The margin is calculated relative to the width of the character.
1086  - `word_margin`: Determines whether two characters on the same line are part of the same word
1087  based on the distance between them. If the distance is greater than the margin specified,
1088  an intermediate space will be added between them to make the text more readable.
1089  The margin is calculated relative to the width of the character.
1090  - `line_margin`: This parameter determines whether two lines are part of the same paragraph based on
1091  the distance between them. If the distance is less than the margin specified,
1092  the lines are considered to be part of the same paragraph.
1093  The margin is calculated relative to the height of a line.
1094  - `boxes_flow`: This parameter determines the importance of horizontal and vertical position when
1095  determining the order of text boxes. A value between -1.0 and +1.0 can be set,
1096  with -1.0 indicating that only horizontal position matters and +1.0 indicating
1097  that only vertical position matters. Setting the value to 'None' will disable advanced
1098  layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
1099  - `detect_vertical`: This parameter determines whether vertical text should be considered during layout analysis.
1100  - `all_texts`: If layout analysis should be performed on text in figures.
1101  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1102  If False, only the file name is stored.
1103  
1104  <a id="pdfminer.PDFMinerToDocument.detect_undecoded_cid_characters"></a>
1105  
1106  #### PDFMinerToDocument.detect\_undecoded\_cid\_characters
1107  
1108  ```python
1109  def detect_undecoded_cid_characters(text: str) -> dict[str, Any]
1110  ```
1111  
1112  Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.
1113  
1114  This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses
1115  non-standard fonts.
1116  
1117  A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like
1118  searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor
1119  needs. If that map is not available the text extractor cannot decode the CID characters and will return them
1120  as is.
1121  
1122  see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output
1123  
1124  :param: text: The text to check for undecoded CID characters
1125  :returns:
1126      A dictionary containing detection results
1127  
1128  
1129  <a id="pdfminer.PDFMinerToDocument.run"></a>
1130  
1131  #### PDFMinerToDocument.run
1132  
1133  ```python
1134  @component.output_types(documents=list[Document])
1135  def run(sources: list[Union[str, Path, ByteStream]],
1136          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1137  ```
1138  
1139  Converts PDF files to Documents.
1140  
1141  **Arguments**:
1142  
1143  - `sources`: List of PDF file paths or ByteStream objects.
1144  - `meta`: Optional metadata to attach to the Documents.
1145  This value can be either a list of dictionaries or a single dictionary.
1146  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1147  If it's a list, the length of the list must match the number of sources, because the two lists will
1148  be zipped.
1149  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1150  
1151  **Returns**:
1152  
1153  A dictionary with the following keys:
1154  - `documents`: Created Documents
1155  
1156  <a id="pptx"></a>
1157  
1158  ## Module pptx
1159  
1160  <a id="pptx.PPTXToDocument"></a>
1161  
1162  ### PPTXToDocument
1163  
1164  Converts PPTX files to Documents.
1165  
1166  Usage example:
1167  ```python
1168  from haystack.components.converters.pptx import PPTXToDocument
1169  
1170  converter = PPTXToDocument()
1171  results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
1172  documents = results["documents"]
1173  print(documents[0].content)
1174  # 'This is the text from the PPTX file.'
1175  ```
1176  
1177  <a id="pptx.PPTXToDocument.__init__"></a>
1178  
1179  #### PPTXToDocument.\_\_init\_\_
1180  
1181  ```python
1182  def __init__(store_full_path: bool = False)
1183  ```
1184  
1185  Create an PPTXToDocument component.
1186  
1187  **Arguments**:
1188  
1189  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1190  If False, only the file name is stored.
1191  
1192  <a id="pptx.PPTXToDocument.run"></a>
1193  
1194  #### PPTXToDocument.run
1195  
1196  ```python
1197  @component.output_types(documents=list[Document])
1198  def run(sources: list[Union[str, Path, ByteStream]],
1199          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1200  ```
1201  
1202  Converts PPTX files to Documents.
1203  
1204  **Arguments**:
1205  
1206  - `sources`: List of file paths or ByteStream objects.
1207  - `meta`: Optional metadata to attach to the Documents.
1208  This value can be either a list of dictionaries or a single dictionary.
1209  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1210  If it's a list, the length of the list must match the number of sources, because the two lists will
1211  be zipped.
1212  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1213  
1214  **Returns**:
1215  
1216  A dictionary with the following keys:
1217  - `documents`: Created Documents
1218  
1219  <a id="pypdf"></a>
1220  
1221  ## Module pypdf
1222  
1223  <a id="pypdf.PyPDFExtractionMode"></a>
1224  
1225  ### PyPDFExtractionMode
1226  
1227  The mode to use for extracting text from a PDF.
1228  
1229  <a id="pypdf.PyPDFExtractionMode.__str__"></a>
1230  
1231  #### PyPDFExtractionMode.\_\_str\_\_
1232  
1233  ```python
1234  def __str__() -> str
1235  ```
1236  
1237  Convert a PyPDFExtractionMode enum to a string.
1238  
1239  <a id="pypdf.PyPDFExtractionMode.from_str"></a>
1240  
1241  #### PyPDFExtractionMode.from\_str
1242  
1243  ```python
1244  @staticmethod
1245  def from_str(string: str) -> "PyPDFExtractionMode"
1246  ```
1247  
1248  Convert a string to a PyPDFExtractionMode enum.
1249  
1250  <a id="pypdf.PyPDFToDocument"></a>
1251  
1252  ### PyPDFToDocument
1253  
1254  Converts PDF files to documents your pipeline can query.
1255  
1256  This component uses the PyPDF library.
1257  You can attach metadata to the resulting documents.
1258  
1259  ### Usage example
1260  
1261  ```python
1262  from haystack.components.converters.pypdf import PyPDFToDocument
1263  
1264  converter = PyPDFToDocument()
1265  results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
1266  documents = results["documents"]
1267  print(documents[0].content)
1268  # 'This is a text from the PDF file.'
1269  ```
1270  
1271  <a id="pypdf.PyPDFToDocument.__init__"></a>
1272  
1273  #### PyPDFToDocument.\_\_init\_\_
1274  
1275  ```python
1276  def __init__(*,
1277               extraction_mode: Union[
1278                   str, PyPDFExtractionMode] = PyPDFExtractionMode.PLAIN,
1279               plain_mode_orientations: tuple = (0, 90, 180, 270),
1280               plain_mode_space_width: float = 200.0,
1281               layout_mode_space_vertically: bool = True,
1282               layout_mode_scale_weight: float = 1.25,
1283               layout_mode_strip_rotated: bool = True,
1284               layout_mode_font_height_weight: float = 1.0,
1285               store_full_path: bool = False)
1286  ```
1287  
1288  Create an PyPDFToDocument component.
1289  
1290  **Arguments**:
1291  
1292  - `extraction_mode`: The mode to use for extracting text from a PDF.
1293  Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
1294  - `plain_mode_orientations`: Tuple of orientations to look for when extracting text from a PDF in plain mode.
1295  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1296  - `plain_mode_space_width`: Forces default space width if not extracted from font.
1297  Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`.
1298  - `layout_mode_space_vertically`: Whether to include blank lines inferred from y distance + font height.
1299  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1300  - `layout_mode_scale_weight`: Multiplier for string length when calculating weighted average character width.
1301  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1302  - `layout_mode_strip_rotated`: Layout mode does not support rotated text. Set to `False` to include rotated text anyway.
1303  If rotated text is discovered, layout will be degraded and a warning will be logged.
1304  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1305  - `layout_mode_font_height_weight`: Multiplier for font height when calculating blank line height.
1306  Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`.
1307  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1308  If False, only the file name is stored.
1309  
1310  <a id="pypdf.PyPDFToDocument.to_dict"></a>
1311  
1312  #### PyPDFToDocument.to\_dict
1313  
1314  ```python
1315  def to_dict()
1316  ```
1317  
1318  Serializes the component to a dictionary.
1319  
1320  **Returns**:
1321  
1322  Dictionary with serialized data.
1323  
1324  <a id="pypdf.PyPDFToDocument.from_dict"></a>
1325  
1326  #### PyPDFToDocument.from\_dict
1327  
1328  ```python
1329  @classmethod
1330  def from_dict(cls, data)
1331  ```
1332  
1333  Deserializes the component from a dictionary.
1334  
1335  **Arguments**:
1336  
1337  - `data`: Dictionary with serialized data.
1338  
1339  **Returns**:
1340  
1341  Deserialized component.
1342  
1343  <a id="pypdf.PyPDFToDocument.run"></a>
1344  
1345  #### PyPDFToDocument.run
1346  
1347  ```python
1348  @component.output_types(documents=list[Document])
1349  def run(sources: list[Union[str, Path, ByteStream]],
1350          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1351  ```
1352  
1353  Converts PDF files to documents.
1354  
1355  **Arguments**:
1356  
1357  - `sources`: List of file paths or ByteStream objects to convert.
1358  - `meta`: Optional metadata to attach to the documents.
1359  This value can be a list of dictionaries or a single dictionary.
1360  If it's a single dictionary, its content is added to the metadata of all produced documents.
1361  If it's a list, its length must match the number of sources, as they are zipped together.
1362  For ByteStream objects, their `meta` is added to the output documents.
1363  
1364  **Returns**:
1365  
1366  A dictionary with the following keys:
1367  - `documents`: A list of converted documents.
1368  
1369  <a id="tika"></a>
1370  
1371  ## Module tika
1372  
1373  <a id="tika.XHTMLParser"></a>
1374  
1375  ### XHTMLParser
1376  
1377  Custom parser to extract pages from Tika XHTML content.
1378  
1379  <a id="tika.XHTMLParser.handle_starttag"></a>
1380  
1381  #### XHTMLParser.handle\_starttag
1382  
1383  ```python
1384  def handle_starttag(tag: str, attrs: list[tuple])
1385  ```
1386  
1387  Identify the start of a page div.
1388  
1389  <a id="tika.XHTMLParser.handle_endtag"></a>
1390  
1391  #### XHTMLParser.handle\_endtag
1392  
1393  ```python
1394  def handle_endtag(tag: str)
1395  ```
1396  
1397  Identify the end of a page div.
1398  
1399  <a id="tika.XHTMLParser.handle_data"></a>
1400  
1401  #### XHTMLParser.handle\_data
1402  
1403  ```python
1404  def handle_data(data: str)
1405  ```
1406  
1407  Populate the page content.
1408  
1409  <a id="tika.TikaDocumentConverter"></a>
1410  
1411  ### TikaDocumentConverter
1412  
1413  Converts files of different types to Documents.
1414  
1415  This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
1416  requires a running Tika server.
1417  For more options on running Tika,
1418  see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
1419  
1420  Usage example:
1421  ```python
1422  from haystack.components.converters.tika import TikaDocumentConverter
1423  
1424  converter = TikaDocumentConverter()
1425  results = converter.run(
1426      sources=["sample.docx", "my_document.rtf", "archive.zip"],
1427      meta={"date_added": datetime.now().isoformat()}
1428  )
1429  documents = results["documents"]
1430  print(documents[0].content)
1431  # 'This is a text from the docx file.'
1432  ```
1433  
1434  <a id="tika.TikaDocumentConverter.__init__"></a>
1435  
1436  #### TikaDocumentConverter.\_\_init\_\_
1437  
1438  ```python
1439  def __init__(tika_url: str = "http://localhost:9998/tika",
1440               store_full_path: bool = False)
1441  ```
1442  
1443  Create a TikaDocumentConverter component.
1444  
1445  **Arguments**:
1446  
1447  - `tika_url`: Tika server URL.
1448  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1449  If False, only the file name is stored.
1450  
1451  <a id="tika.TikaDocumentConverter.run"></a>
1452  
1453  #### TikaDocumentConverter.run
1454  
1455  ```python
1456  @component.output_types(documents=list[Document])
1457  def run(sources: list[Union[str, Path, ByteStream]],
1458          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1459  ```
1460  
1461  Converts files to Documents.
1462  
1463  **Arguments**:
1464  
1465  - `sources`: List of HTML file paths or ByteStream objects.
1466  - `meta`: Optional metadata to attach to the Documents.
1467  This value can be either a list of dictionaries or a single dictionary.
1468  If it's a single dictionary, its content is added to the metadata of all produced Documents.
1469  If it's a list, the length of the list must match the number of sources, because the two lists will
1470  be zipped.
1471  If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
1472  
1473  **Returns**:
1474  
1475  A dictionary with the following keys:
1476  - `documents`: Created Documents
1477  
1478  <a id="txt"></a>
1479  
1480  ## Module txt
1481  
1482  <a id="txt.TextFileToDocument"></a>
1483  
1484  ### TextFileToDocument
1485  
1486  Converts text files to documents your pipeline can query.
1487  
1488  By default, it uses UTF-8 encoding when converting files but
1489  you can also set custom encoding.
1490  It can attach metadata to the resulting documents.
1491  
1492  ### Usage example
1493  
1494  ```python
1495  from haystack.components.converters.txt import TextFileToDocument
1496  
1497  converter = TextFileToDocument()
1498  results = converter.run(sources=["sample.txt"])
1499  documents = results["documents"]
1500  print(documents[0].content)
1501  # 'This is the content from the txt file.'
1502  ```
1503  
1504  <a id="txt.TextFileToDocument.__init__"></a>
1505  
1506  #### TextFileToDocument.\_\_init\_\_
1507  
1508  ```python
1509  def __init__(encoding: str = "utf-8", store_full_path: bool = False)
1510  ```
1511  
1512  Creates a TextFileToDocument component.
1513  
1514  **Arguments**:
1515  
1516  - `encoding`: The encoding of the text files to convert.
1517  If the encoding is specified in the metadata of a source ByteStream,
1518  it overrides this value.
1519  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1520  If False, only the file name is stored.
1521  
1522  <a id="txt.TextFileToDocument.run"></a>
1523  
1524  #### TextFileToDocument.run
1525  
1526  ```python
1527  @component.output_types(documents=list[Document])
1528  def run(sources: list[Union[str, Path, ByteStream]],
1529          meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)
1530  ```
1531  
1532  Converts text files to documents.
1533  
1534  **Arguments**:
1535  
1536  - `sources`: List of text file paths or ByteStream objects to convert.
1537  - `meta`: Optional metadata to attach to the documents.
1538  This value can be a list of dictionaries or a single dictionary.
1539  If it's a single dictionary, its content is added to the metadata of all produced documents.
1540  If it's a list, its length must match the number of sources as they're zipped together.
1541  For ByteStream objects, their `meta` is added to the output documents.
1542  
1543  **Returns**:
1544  
1545  A dictionary with the following keys:
1546  - `documents`: A list of converted documents.
1547  
1548  <a id="xlsx"></a>
1549  
1550  ## Module xlsx
1551  
1552  <a id="xlsx.XLSXToDocument"></a>
1553  
1554  ### XLSXToDocument
1555  
1556  Converts XLSX (Excel) files into Documents.
1557  
1558      Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
1559      created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.
1560  
1561      ### Usage example
1562  
1563      ```python
1564      from haystack.components.converters.xlsx import XLSXToDocument
1565  
1566      converter = XLSXToDocument()
1567      results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
1568      documents = results["documents"]
1569      print(documents[0].content)
1570      # ",A,B
1571  1,col_a,col_b
1572  2,1.5,test
1573  "
1574      ```
1575  
1576  <a id="xlsx.XLSXToDocument.__init__"></a>
1577  
1578  #### XLSXToDocument.\_\_init\_\_
1579  
1580  ```python
1581  def __init__(table_format: Literal["csv", "markdown"] = "csv",
1582               sheet_name: Union[str, int, list[Union[str, int]], None] = None,
1583               read_excel_kwargs: Optional[dict[str, Any]] = None,
1584               table_format_kwargs: Optional[dict[str, Any]] = None,
1585               *,
1586               store_full_path: bool = False)
1587  ```
1588  
1589  Creates a XLSXToDocument component.
1590  
1591  **Arguments**:
1592  
1593  - `table_format`: The format to convert the Excel file to.
1594  - `sheet_name`: The name of the sheet to read. If None, all sheets are read.
1595  - `read_excel_kwargs`: Additional arguments to pass to `pandas.read_excel`.
1596  See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel
1597  - `table_format_kwargs`: Additional keyword arguments to pass to the table format function.
1598  - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`.
1599    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
1600  - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`.
1601    See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
1602  - `store_full_path`: If True, the full path of the file is stored in the metadata of the document.
1603  If False, only the file name is stored.
1604  
1605  <a id="xlsx.XLSXToDocument.run"></a>
1606  
1607  #### XLSXToDocument.run
1608  
1609  ```python
1610  @component.output_types(documents=list[Document])
1611  def run(
1612      sources: list[Union[str, Path, ByteStream]],
1613      meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
1614  ) -> dict[str, list[Document]]
1615  ```
1616  
1617  Converts a XLSX file to a Document.
1618  
1619  **Arguments**:
1620  
1621  - `sources`: List of file paths or ByteStream objects.
1622  - `meta`: Optional metadata to attach to the documents.
1623  This value can be either a list of dictionaries or a single dictionary.
1624  If it's a single dictionary, its content is added to the metadata of all produced documents.
1625  If it's a list, the length of the list must match the number of sources, because the two lists will
1626  be zipped.
1627  If `sources` contains ByteStream objects, their `meta` will be added to the output documents.
1628  
1629  **Returns**:
1630  
1631  A dictionary with the following keys:
1632  - `documents`: Created documents
1633