converters_api.md
1 --- 2 title: "Converters" 3 id: converters-api 4 description: "Various converters to transform data from one format to another." 5 slug: "/converters-api" 6 --- 7 8 <a id="azure"></a> 9 10 ## Module azure 11 12 <a id="azure.AzureOCRDocumentConverter"></a> 13 14 ### AzureOCRDocumentConverter 15 16 Converts files to documents using Azure's Document Intelligence service. 17 18 Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 19 20 To use this component, you need an active Azure account 21 and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see 22 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 23 24 ### Usage example 25 26 ```python 27 import os 28 from datetime import datetime 29 from haystack.components.converters import AzureOCRDocumentConverter 30 from haystack.utils import Secret 31 32 converter = AzureOCRDocumentConverter( 33 endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"], 34 api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"), 35 ) 36 results = converter.run( 37 sources=["test/test_files/pdf/react_paper.pdf"], 38 meta={"date_added": datetime.now().isoformat()}, 39 ) 40 documents = results["documents"] 41 print(documents[0].content) 42 # 'This is a text from the PDF file.' 43 ``` 44 45 <a id="azure.AzureOCRDocumentConverter.__init__"></a> 46 47 #### AzureOCRDocumentConverter.\_\_init\_\_ 48 49 ```python 50 def __init__(endpoint: str, 51 api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), 52 model_id: str = "prebuilt-read", 53 preceding_context_len: int = 3, 54 following_context_len: int = 3, 55 merge_multiple_column_headers: bool = True, 56 page_layout: Literal["natural", "single_column"] = "natural", 57 threshold_y: float | None = 0.05, 58 store_full_path: bool = False) 59 ``` 60 61 Creates an AzureOCRDocumentConverter component. 62 63 **Arguments**: 64 65 - `endpoint`: The endpoint of your Azure resource. 66 - `api_key`: The API key of your Azure resource. 67 - `model_id`: The ID of the model you want to use. For a list of available models, see [Azure documentation] 68 (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). 69 - `preceding_context_len`: Number of lines before a table to include as preceding context 70 (this will be added to the metadata). 71 - `following_context_len`: Number of lines after a table to include as subsequent context ( 72 this will be added to the metadata). 73 - `merge_multiple_column_headers`: If `True`, merges multiple column header rows into a single row. 74 - `page_layout`: The type reading order to follow. Possible options: 75 - `natural`: Uses the natural reading order determined by Azure. 76 - `single_column`: Groups all lines with the same height on the page based on a threshold 77 determined by `threshold_y`. 78 - `threshold_y`: Only relevant if `single_column` is set to `page_layout`. 79 The threshold, in inches, to determine if two recognized PDF elements are grouped into a 80 single line. This is crucial for section headers or numbers which may be spatially separated 81 from the remaining text on the horizontal axis. 82 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 83 If False, only the file name is stored. 84 85 <a id="azure.AzureOCRDocumentConverter.run"></a> 86 87 #### AzureOCRDocumentConverter.run 88 89 ```python 90 @component.output_types(documents=list[Document], 91 raw_azure_response=list[dict]) 92 def run(sources: list[str | Path | ByteStream], 93 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 94 ``` 95 96 Convert a list of files to Documents using Azure's Document Intelligence service. 97 98 **Arguments**: 99 100 - `sources`: List of file paths or ByteStream objects. 101 - `meta`: Optional metadata to attach to the Documents. 102 This value can be either a list of dictionaries or a single dictionary. 103 If it's a single dictionary, its content is added to the metadata of all produced Documents. 104 If it's a list, the length of the list must match the number of sources, because the two lists will be 105 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 106 107 **Returns**: 108 109 A dictionary with the following keys: 110 - `documents`: List of created Documents 111 - `raw_azure_response`: List of raw Azure responses used to create the Documents 112 113 <a id="azure.AzureOCRDocumentConverter.to_dict"></a> 114 115 #### AzureOCRDocumentConverter.to\_dict 116 117 ```python 118 def to_dict() -> dict[str, Any] 119 ``` 120 121 Serializes the component to a dictionary. 122 123 **Returns**: 124 125 Dictionary with serialized data. 126 127 <a id="azure.AzureOCRDocumentConverter.from_dict"></a> 128 129 #### AzureOCRDocumentConverter.from\_dict 130 131 ```python 132 @classmethod 133 def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter" 134 ``` 135 136 Deserializes the component from a dictionary. 137 138 **Arguments**: 139 140 - `data`: The dictionary to deserialize from. 141 142 **Returns**: 143 144 The deserialized component. 145 146 <a id="csv"></a> 147 148 ## Module csv 149 150 <a id="csv.CSVToDocument"></a> 151 152 ### CSVToDocument 153 154 Converts CSV files to Documents. 155 156 By default, it uses UTF-8 encoding when converting files but 157 you can also set a custom encoding. 158 It can attach metadata to the resulting documents. 159 160 ### Usage example 161 162 ```python 163 from haystack.components.converters.csv import CSVToDocument 164 converter = CSVToDocument() 165 results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()}) 166 documents = results["documents"] 167 print(documents[0].content) 168 # 'col1,col2\nrow1,row1\nrow2,row2\n' 169 ``` 170 171 <a id="csv.CSVToDocument.__init__"></a> 172 173 #### CSVToDocument.\_\_init\_\_ 174 175 ```python 176 def __init__(encoding: str = "utf-8", 177 store_full_path: bool = False, 178 *, 179 conversion_mode: Literal["file", "row"] = "file", 180 delimiter: str = ",", 181 quotechar: str = '"') 182 ``` 183 184 Creates a CSVToDocument component. 185 186 **Arguments**: 187 188 - `encoding`: The encoding of the csv files to convert. 189 If the encoding is specified in the metadata of a source ByteStream, 190 it overrides this value. 191 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 192 If False, only the file name is stored. 193 - `conversion_mode`: - "file" (default): one Document per CSV file whose content is the raw CSV text. 194 - "row": convert each CSV row to its own Document (requires `content_column` in `run()`). 195 - `delimiter`: CSV delimiter used when parsing in row mode (passed to ``csv.DictReader``). 196 - `quotechar`: CSV quote character used when parsing in row mode (passed to ``csv.DictReader``). 197 198 <a id="csv.CSVToDocument.run"></a> 199 200 #### CSVToDocument.run 201 202 ```python 203 @component.output_types(documents=list[Document]) 204 def run(sources: list[str | Path | ByteStream], 205 *, 206 content_column: str | None = None, 207 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 208 ``` 209 210 Converts CSV files to a Document (file mode) or to one Document per row (row mode). 211 212 **Arguments**: 213 214 - `sources`: List of file paths or ByteStream objects. 215 - `content_column`: **Required when** ``conversion_mode="row"``. 216 The column name whose values become ``Document.content`` for each row. 217 The column must exist in the CSV header. 218 - `meta`: Optional metadata to attach to the documents. 219 This value can be either a list of dictionaries or a single dictionary. 220 If it's a single dictionary, its content is added to the metadata of all produced documents. 221 If it's a list, the length of the list must match the number of sources, because the two lists will 222 be zipped. 223 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 224 225 **Returns**: 226 227 A dictionary with the following keys: 228 - `documents`: Created documents 229 230 <a id="docx"></a> 231 232 ## Module docx 233 234 <a id="docx.DOCXMetadata"></a> 235 236 ### DOCXMetadata 237 238 Describes the metadata of Docx file. 239 240 **Arguments**: 241 242 - `author`: The author 243 - `category`: The category 244 - `comments`: The comments 245 - `content_status`: The content status 246 - `created`: The creation date (ISO formatted string) 247 - `identifier`: The identifier 248 - `keywords`: Available keywords 249 - `language`: The language of the document 250 - `last_modified_by`: User who last modified the document 251 - `last_printed`: The last printed date (ISO formatted string) 252 - `modified`: The last modification date (ISO formatted string) 253 - `revision`: The revision number 254 - `subject`: The subject 255 - `title`: The title 256 - `version`: The version 257 258 <a id="docx.DOCXTableFormat"></a> 259 260 ### DOCXTableFormat 261 262 Supported formats for storing DOCX tabular data in a Document. 263 264 <a id="docx.DOCXTableFormat.from_str"></a> 265 266 #### DOCXTableFormat.from\_str 267 268 ```python 269 @staticmethod 270 def from_str(string: str) -> "DOCXTableFormat" 271 ``` 272 273 Convert a string to a DOCXTableFormat enum. 274 275 <a id="docx.DOCXLinkFormat"></a> 276 277 ### DOCXLinkFormat 278 279 Supported formats for storing DOCX link information in a Document. 280 281 <a id="docx.DOCXLinkFormat.from_str"></a> 282 283 #### DOCXLinkFormat.from\_str 284 285 ```python 286 @staticmethod 287 def from_str(string: str) -> "DOCXLinkFormat" 288 ``` 289 290 Convert a string to a DOCXLinkFormat enum. 291 292 <a id="docx.DOCXToDocument"></a> 293 294 ### DOCXToDocument 295 296 Converts DOCX files to Documents. 297 298 Uses `python-docx` library to convert the DOCX file to a document. 299 This component does not preserve page breaks in the original document. 300 301 Usage example: 302 ```python 303 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat 304 305 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN) 306 results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()}) 307 documents = results["documents"] 308 print(documents[0].content) 309 # 'This is a text from the DOCX file.' 310 ``` 311 312 <a id="docx.DOCXToDocument.__init__"></a> 313 314 #### DOCXToDocument.\_\_init\_\_ 315 316 ```python 317 def __init__(table_format: str | DOCXTableFormat = DOCXTableFormat.CSV, 318 link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE, 319 store_full_path: bool = False) 320 ``` 321 322 Create a DOCXToDocument component. 323 324 **Arguments**: 325 326 - `table_format`: The format for table output. Can be either DOCXTableFormat.MARKDOWN, 327 DOCXTableFormat.CSV, "markdown", or "csv". 328 - `link_format`: The format for link output. Can be either: 329 DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`, 330 DOCXLinkFormat.PLAIN or "plain" to get text (address), 331 DOCXLinkFormat.NONE or "none" to get text without links. 332 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 333 If False, only the file name is stored. 334 335 <a id="docx.DOCXToDocument.to_dict"></a> 336 337 #### DOCXToDocument.to\_dict 338 339 ```python 340 def to_dict() -> dict[str, Any] 341 ``` 342 343 Serializes the component to a dictionary. 344 345 **Returns**: 346 347 Dictionary with serialized data. 348 349 <a id="docx.DOCXToDocument.from_dict"></a> 350 351 #### DOCXToDocument.from\_dict 352 353 ```python 354 @classmethod 355 def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument" 356 ``` 357 358 Deserializes the component from a dictionary. 359 360 **Arguments**: 361 362 - `data`: The dictionary to deserialize from. 363 364 **Returns**: 365 366 The deserialized component. 367 368 <a id="docx.DOCXToDocument.run"></a> 369 370 #### DOCXToDocument.run 371 372 ```python 373 @component.output_types(documents=list[Document]) 374 def run(sources: list[str | Path | ByteStream], 375 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 376 ``` 377 378 Converts DOCX files to Documents. 379 380 **Arguments**: 381 382 - `sources`: List of file paths or ByteStream objects. 383 - `meta`: Optional metadata to attach to the Documents. 384 This value can be either a list of dictionaries or a single dictionary. 385 If it's a single dictionary, its content is added to the metadata of all produced Documents. 386 If it's a list, the length of the list must match the number of sources, because the two lists will 387 be zipped. 388 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 389 390 **Returns**: 391 392 A dictionary with the following keys: 393 - `documents`: Created Documents 394 395 <a id="html"></a> 396 397 ## Module html 398 399 <a id="html.HTMLToDocument"></a> 400 401 ### HTMLToDocument 402 403 Converts an HTML file to a Document. 404 405 Usage example: 406 ```python 407 from haystack.components.converters import HTMLToDocument 408 409 converter = HTMLToDocument() 410 results = converter.run(sources=["path/to/sample.html"]) 411 documents = results["documents"] 412 print(documents[0].content) 413 # 'This is a text from the HTML file.' 414 ``` 415 416 <a id="html.HTMLToDocument.__init__"></a> 417 418 #### HTMLToDocument.\_\_init\_\_ 419 420 ```python 421 def __init__(extraction_kwargs: dict[str, Any] | None = None, 422 store_full_path: bool = False) 423 ``` 424 425 Create an HTMLToDocument component. 426 427 **Arguments**: 428 429 - `extraction_kwargs`: A dictionary containing keyword arguments to customize the extraction process. These 430 are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see 431 the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 432 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 433 If False, only the file name is stored. 434 435 <a id="html.HTMLToDocument.to_dict"></a> 436 437 #### HTMLToDocument.to\_dict 438 439 ```python 440 def to_dict() -> dict[str, Any] 441 ``` 442 443 Serializes the component to a dictionary. 444 445 **Returns**: 446 447 Dictionary with serialized data. 448 449 <a id="html.HTMLToDocument.from_dict"></a> 450 451 #### HTMLToDocument.from\_dict 452 453 ```python 454 @classmethod 455 def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument" 456 ``` 457 458 Deserializes the component from a dictionary. 459 460 **Arguments**: 461 462 - `data`: The dictionary to deserialize from. 463 464 **Returns**: 465 466 The deserialized component. 467 468 <a id="html.HTMLToDocument.run"></a> 469 470 #### HTMLToDocument.run 471 472 ```python 473 @component.output_types(documents=list[Document]) 474 def run(sources: list[str | Path | ByteStream], 475 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 476 extraction_kwargs: dict[str, Any] | None = None) 477 ``` 478 479 Converts a list of HTML files to Documents. 480 481 **Arguments**: 482 483 - `sources`: List of HTML file paths or ByteStream objects. 484 - `meta`: Optional metadata to attach to the Documents. 485 This value can be either a list of dictionaries or a single dictionary. 486 If it's a single dictionary, its content is added to the metadata of all produced Documents. 487 If it's a list, the length of the list must match the number of sources, because the two lists will 488 be zipped. 489 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 490 - `extraction_kwargs`: Additional keyword arguments to customize the extraction process. 491 492 **Returns**: 493 494 A dictionary with the following keys: 495 - `documents`: Created Documents 496 497 <a id="json"></a> 498 499 ## Module json 500 501 <a id="json.JSONConverter"></a> 502 503 ### JSONConverter 504 505 Converts one or more JSON files into a text document. 506 507 ### Usage examples 508 509 ```python 510 import json 511 512 from haystack.components.converters import JSONConverter 513 from haystack.dataclasses import ByteStream 514 515 source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"})) 516 517 converter = JSONConverter(content_key="text") 518 results = converter.run(sources=[source]) 519 documents = results["documents"] 520 print(documents[0].content) 521 # 'This is the content of my document' 522 ``` 523 524 Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields` 525 to extract from the filtered data: 526 527 ```python 528 import json 529 530 from haystack.components.converters import JSONConverter 531 from haystack.dataclasses import ByteStream 532 533 data = { 534 "laureates": [ 535 { 536 "firstname": "Enrico", 537 "surname": "Fermi", 538 "motivation": "for his demonstrations of the existence of new radioactive elements produced " 539 "by neutron irradiation, and for his related discovery of nuclear reactions brought about by" 540 " slow neutrons", 541 }, 542 { 543 "firstname": "Rita", 544 "surname": "Levi-Montalcini", 545 "motivation": "for their discoveries of growth factors", 546 }, 547 ], 548 } 549 source = ByteStream.from_string(json.dumps(data)) 550 converter = JSONConverter( 551 jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"} 552 ) 553 554 results = converter.run(sources=[source]) 555 documents = results["documents"] 556 print(documents[0].content) 557 # 'for his demonstrations of the existence of new radioactive elements produced by 558 # neutron irradiation, and for his related discovery of nuclear reactions brought 559 # about by slow neutrons' 560 561 print(documents[0].meta) 562 # {'firstname': 'Enrico', 'surname': 'Fermi'} 563 564 print(documents[1].content) 565 # 'for their discoveries of growth factors' 566 567 print(documents[1].meta) 568 # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'} 569 ``` 570 571 <a id="json.JSONConverter.__init__"></a> 572 573 #### JSONConverter.\_\_init\_\_ 574 575 ```python 576 def __init__(jq_schema: str | None = None, 577 content_key: str | None = None, 578 extra_meta_fields: set[str] | Literal["*"] | None = None, 579 store_full_path: bool = False) 580 ``` 581 582 Creates a JSONConverter component. 583 584 An optional `jq_schema` can be provided to extract nested data in the JSON source files. 585 See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax. 586 If `jq_schema` is not set, whole JSON source files will be used to extract content. 587 588 Optionally, you can provide a `content_key` to specify which key in the extracted object must 589 be set as the document's content. 590 591 If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in 592 the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped. 593 594 If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array, 595 it will be skipped. 596 597 If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped. 598 599 `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string. 600 If it's a set of strings, it must specify fields in the extracted objects that must be set in 601 the extracted documents. If a field is not found, the meta value will be `None`. 602 If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will 603 be saved as metadata. 604 605 Initialization will fail if neither `jq_schema` nor `content_key` are set. 606 607 **Arguments**: 608 609 - `jq_schema`: Optional jq filter string to extract content. 610 If not specified, whole JSON object will be used to extract information. 611 - `content_key`: Optional key to extract document content. 612 If `jq_schema` is specified, the `content_key` will be extracted from that object. 613 - `extra_meta_fields`: An optional set of meta keys to extract from the content. 614 If `jq_schema` is specified, all keys will be extracted from that object. 615 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 616 If False, only the file name is stored. 617 618 <a id="json.JSONConverter.to_dict"></a> 619 620 #### JSONConverter.to\_dict 621 622 ```python 623 def to_dict() -> dict[str, Any] 624 ``` 625 626 Serializes the component to a dictionary. 627 628 **Returns**: 629 630 Dictionary with serialized data. 631 632 <a id="json.JSONConverter.from_dict"></a> 633 634 #### JSONConverter.from\_dict 635 636 ```python 637 @classmethod 638 def from_dict(cls, data: dict[str, Any]) -> "JSONConverter" 639 ``` 640 641 Deserializes the component from a dictionary. 642 643 **Arguments**: 644 645 - `data`: Dictionary to deserialize from. 646 647 **Returns**: 648 649 Deserialized component. 650 651 <a id="json.JSONConverter.run"></a> 652 653 #### JSONConverter.run 654 655 ```python 656 @component.output_types(documents=list[Document]) 657 def run(sources: list[str | Path | ByteStream], 658 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 659 ``` 660 661 Converts a list of JSON files to documents. 662 663 **Arguments**: 664 665 - `sources`: A list of file paths or ByteStream objects. 666 - `meta`: Optional metadata to attach to the documents. 667 This value can be either a list of dictionaries or a single dictionary. 668 If it's a single dictionary, its content is added to the metadata of all produced documents. 669 If it's a list, the length of the list must match the number of sources. 670 If `sources` contain ByteStream objects, their `meta` will be added to the output documents. 671 672 **Returns**: 673 674 A dictionary with the following keys: 675 - `documents`: A list of created documents. 676 677 <a id="markdown"></a> 678 679 ## Module markdown 680 681 <a id="markdown.MarkdownToDocument"></a> 682 683 ### MarkdownToDocument 684 685 Converts a Markdown file into a text Document. 686 687 Usage example: 688 ```python 689 from haystack.components.converters import MarkdownToDocument 690 from datetime import datetime 691 692 converter = MarkdownToDocument() 693 results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()}) 694 documents = results["documents"] 695 print(documents[0].content) 696 # 'This is a text from the markdown file.' 697 ``` 698 699 <a id="markdown.MarkdownToDocument.__init__"></a> 700 701 #### MarkdownToDocument.\_\_init\_\_ 702 703 ```python 704 def __init__(table_to_single_line: bool = False, 705 progress_bar: bool = True, 706 store_full_path: bool = False) 707 ``` 708 709 Create a MarkdownToDocument component. 710 711 **Arguments**: 712 713 - `table_to_single_line`: If True converts table contents into a single line. 714 - `progress_bar`: If True shows a progress bar when running. 715 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 716 If False, only the file name is stored. 717 718 <a id="markdown.MarkdownToDocument.run"></a> 719 720 #### MarkdownToDocument.run 721 722 ```python 723 @component.output_types(documents=list[Document]) 724 def run(sources: list[str | Path | ByteStream], 725 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 726 ``` 727 728 Converts a list of Markdown files to Documents. 729 730 **Arguments**: 731 732 - `sources`: List of file paths or ByteStream objects. 733 - `meta`: Optional metadata to attach to the Documents. 734 This value can be either a list of dictionaries or a single dictionary. 735 If it's a single dictionary, its content is added to the metadata of all produced Documents. 736 If it's a list, the length of the list must match the number of sources, because the two lists will 737 be zipped. 738 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 739 740 **Returns**: 741 742 A dictionary with the following keys: 743 - `documents`: List of created Documents 744 745 <a id="msg"></a> 746 747 ## Module msg 748 749 <a id="msg.MSGToDocument"></a> 750 751 ### MSGToDocument 752 753 Converts Microsoft Outlook .msg files into Haystack Documents. 754 755 This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg 756 files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg 757 file are extracted as ByteStream objects. 758 759 ### Example Usage 760 761 ```python 762 from haystack.components.converters.msg import MSGToDocument 763 from datetime import datetime 764 765 converter = MSGToDocument() 766 results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()}) 767 documents = results["documents"] 768 attachments = results["attachments"] 769 print(documents[0].content) 770 ``` 771 772 <a id="msg.MSGToDocument.__init__"></a> 773 774 #### MSGToDocument.\_\_init\_\_ 775 776 ```python 777 def __init__(store_full_path: bool = False) -> None 778 ``` 779 780 Creates a MSGToDocument component. 781 782 **Arguments**: 783 784 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 785 If False, only the file name is stored. 786 787 <a id="msg.MSGToDocument.run"></a> 788 789 #### MSGToDocument.run 790 791 ```python 792 @component.output_types(documents=list[Document], attachments=list[ByteStream]) 793 def run( 794 sources: list[str | Path | ByteStream], 795 meta: dict[str, Any] | list[dict[str, Any]] | None = None 796 ) -> dict[str, list[Document] | list[ByteStream]] 797 ``` 798 799 Converts MSG files to Documents. 800 801 **Arguments**: 802 803 - `sources`: List of file paths or ByteStream objects. 804 - `meta`: Optional metadata to attach to the Documents. 805 This value can be either a list of dictionaries or a single dictionary. 806 If it's a single dictionary, its content is added to the metadata of all produced Documents. 807 If it's a list, the length of the list must match the number of sources, because the two lists will 808 be zipped. 809 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 810 811 **Returns**: 812 813 A dictionary with the following keys: 814 - `documents`: Created Documents. 815 - `attachments`: Created ByteStream objects from file attachments. 816 817 <a id="multi_file_converter"></a> 818 819 ## Module multi\_file\_converter 820 821 <a id="multi_file_converter.MultiFileConverter"></a> 822 823 ### MultiFileConverter 824 825 A file converter that handles conversion of multiple file types. 826 827 The MultiFileConverter handles the following file types: 828 - CSV 829 - DOCX 830 - HTML 831 - JSON 832 - MD 833 - TEXT 834 - PDF (no OCR) 835 - PPTX 836 - XLSX 837 838 Usage example: 839 ``` 840 from haystack.super_components.converters import MultiFileConverter 841 842 converter = MultiFileConverter() 843 converter.run(sources=["test.txt", "test.pdf"], meta={}) 844 ``` 845 846 <a id="multi_file_converter.MultiFileConverter.__init__"></a> 847 848 #### MultiFileConverter.\_\_init\_\_ 849 850 ```python 851 def __init__(encoding: str = "utf-8", 852 json_content_key: str = "content") -> None 853 ``` 854 855 Initialize the MultiFileConverter. 856 857 **Arguments**: 858 859 - `encoding`: The encoding to use when reading files. 860 - `json_content_key`: The key to use in a content field in a document when converting JSON files. 861 862 <a id="openapi_functions"></a> 863 864 ## Module openapi\_functions 865 866 <a id="openapi_functions.OpenAPIServiceToFunctions"></a> 867 868 ### OpenAPIServiceToFunctions 869 870 Converts OpenAPI service definitions to a format suitable for OpenAI function calling. 871 872 The definition must respect OpenAPI specification 3.0.0 or higher. 873 It can be specified in JSON or YAML format. 874 Each function must have: 875 - unique operationId 876 - description 877 - requestBody and/or parameters 878 - schema for the requestBody and/or parameters 879 For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification). 880 For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling). 881 882 Usage example: 883 ```python 884 from haystack.components.converters import OpenAPIServiceToFunctions 885 886 converter = OpenAPIServiceToFunctions() 887 result = converter.run(sources=["path/to/openapi_definition.yaml"]) 888 assert result["functions"] 889 ``` 890 891 <a id="openapi_functions.OpenAPIServiceToFunctions.__init__"></a> 892 893 #### OpenAPIServiceToFunctions.\_\_init\_\_ 894 895 ```python 896 def __init__() 897 ``` 898 899 Create an OpenAPIServiceToFunctions component. 900 901 <a id="openapi_functions.OpenAPIServiceToFunctions.run"></a> 902 903 #### OpenAPIServiceToFunctions.run 904 905 ```python 906 @component.output_types(functions=list[dict[str, Any]], 907 openapi_specs=list[dict[str, Any]]) 908 def run(sources: list[str | Path | ByteStream]) -> dict[str, Any] 909 ``` 910 911 Converts OpenAPI definitions in OpenAI function calling format. 912 913 **Arguments**: 914 915 - `sources`: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format). 916 917 **Raises**: 918 919 - `RuntimeError`: If the OpenAPI definitions cannot be downloaded or processed. 920 - `ValueError`: If the source type is not recognized or no functions are found in the OpenAPI definitions. 921 922 **Returns**: 923 924 A dictionary with the following keys: 925 - functions: Function definitions in JSON object format 926 - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references 927 928 <a id="output_adapter"></a> 929 930 ## Module output\_adapter 931 932 <a id="output_adapter.OutputAdaptationException"></a> 933 934 ### OutputAdaptationException 935 936 Exception raised when there is an error during output adaptation. 937 938 <a id="output_adapter.OutputAdapter"></a> 939 940 ### OutputAdapter 941 942 Adapts output of a Component using Jinja templates. 943 944 Usage example: 945 ```python 946 from haystack import Document 947 from haystack.components.converters import OutputAdapter 948 949 adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str) 950 documents = [Document(content="Test content"] 951 result = adapter.run(documents=documents) 952 953 assert result["output"] == "Test content" 954 ``` 955 956 <a id="output_adapter.OutputAdapter.__init__"></a> 957 958 #### OutputAdapter.\_\_init\_\_ 959 960 ```python 961 def __init__(template: str, 962 output_type: TypeAlias, 963 custom_filters: dict[str, Callable] | None = None, 964 unsafe: bool = False) -> None 965 ``` 966 967 Create an OutputAdapter component. 968 969 **Arguments**: 970 971 - `template`: A Jinja template that defines how to adapt the input data. 972 The variables in the template define the input of this instance. 973 e.g. 974 With this template: 975 ``` 976 {{ documents[0].content }} 977 ``` 978 The Component input will be `documents`. 979 - `output_type`: The type of output this instance will return. 980 - `custom_filters`: A dictionary of custom Jinja filters used in the template. 981 - `unsafe`: Enable execution of arbitrary code in the Jinja template. 982 This should only be used if you trust the source of the template as it can be lead to remote code execution. 983 984 <a id="output_adapter.OutputAdapter.run"></a> 985 986 #### OutputAdapter.run 987 988 ```python 989 def run(**kwargs) 990 ``` 991 992 Renders the Jinja template with the provided inputs. 993 994 **Arguments**: 995 996 - `kwargs`: Must contain all variables used in the `template` string. 997 998 **Raises**: 999 1000 - `OutputAdaptationException`: If template rendering fails. 1001 1002 **Returns**: 1003 1004 A dictionary with the following keys: 1005 - `output`: Rendered Jinja template. 1006 1007 <a id="output_adapter.OutputAdapter.to_dict"></a> 1008 1009 #### OutputAdapter.to\_dict 1010 1011 ```python 1012 def to_dict() -> dict[str, Any] 1013 ``` 1014 1015 Serializes the component to a dictionary. 1016 1017 **Returns**: 1018 1019 Dictionary with serialized data. 1020 1021 <a id="output_adapter.OutputAdapter.from_dict"></a> 1022 1023 #### OutputAdapter.from\_dict 1024 1025 ```python 1026 @classmethod 1027 def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter" 1028 ``` 1029 1030 Deserializes the component from a dictionary. 1031 1032 **Arguments**: 1033 1034 - `data`: The dictionary to deserialize from. 1035 1036 **Returns**: 1037 1038 The deserialized component. 1039 1040 <a id="pdfminer"></a> 1041 1042 ## Module pdfminer 1043 1044 <a id="pdfminer.CID_PATTERN"></a> 1045 1046 #### CID\_PATTERN 1047 1048 regex pattern to detect CID characters 1049 1050 <a id="pdfminer.PDFMinerToDocument"></a> 1051 1052 ### PDFMinerToDocument 1053 1054 Converts PDF files to Documents. 1055 1056 Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/ 1057 1058 Usage example: 1059 ```python 1060 from haystack.components.converters.pdfminer import PDFMinerToDocument 1061 1062 converter = PDFMinerToDocument() 1063 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1064 documents = results["documents"] 1065 print(documents[0].content) 1066 # 'This is a text from the PDF file.' 1067 ``` 1068 1069 <a id="pdfminer.PDFMinerToDocument.__init__"></a> 1070 1071 #### PDFMinerToDocument.\_\_init\_\_ 1072 1073 ```python 1074 def __init__(line_overlap: float = 0.5, 1075 char_margin: float = 2.0, 1076 line_margin: float = 0.5, 1077 word_margin: float = 0.1, 1078 boxes_flow: float | None = 0.5, 1079 detect_vertical: bool = True, 1080 all_texts: bool = False, 1081 store_full_path: bool = False) -> None 1082 ``` 1083 1084 Create a PDFMinerToDocument component. 1085 1086 **Arguments**: 1087 1088 - `line_overlap`: This parameter determines whether two characters are considered to be on 1089 the same line based on the amount of overlap between them. 1090 The overlap is calculated relative to the minimum height of both characters. 1091 - `char_margin`: Determines whether two characters are part of the same line based on the distance between them. 1092 If the distance is less than the margin specified, the characters are considered to be on the same line. 1093 The margin is calculated relative to the width of the character. 1094 - `word_margin`: Determines whether two characters on the same line are part of the same word 1095 based on the distance between them. If the distance is greater than the margin specified, 1096 an intermediate space will be added between them to make the text more readable. 1097 The margin is calculated relative to the width of the character. 1098 - `line_margin`: This parameter determines whether two lines are part of the same paragraph based on 1099 the distance between them. If the distance is less than the margin specified, 1100 the lines are considered to be part of the same paragraph. 1101 The margin is calculated relative to the height of a line. 1102 - `boxes_flow`: This parameter determines the importance of horizontal and vertical position when 1103 determining the order of text boxes. A value between -1.0 and +1.0 can be set, 1104 with -1.0 indicating that only horizontal position matters and +1.0 indicating 1105 that only vertical position matters. Setting the value to 'None' will disable advanced 1106 layout analysis, and text boxes will be ordered based on the position of their bottom left corner. 1107 - `detect_vertical`: This parameter determines whether vertical text should be considered during layout analysis. 1108 - `all_texts`: If layout analysis should be performed on text in figures. 1109 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1110 If False, only the file name is stored. 1111 1112 <a id="pdfminer.PDFMinerToDocument.detect_undecoded_cid_characters"></a> 1113 1114 #### PDFMinerToDocument.detect\_undecoded\_cid\_characters 1115 1116 ```python 1117 def detect_undecoded_cid_characters(text: str) -> dict[str, Any] 1118 ``` 1119 1120 Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format. 1121 1122 This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses 1123 non-standard fonts. 1124 1125 A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like 1126 searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor 1127 needs. If that map is not available the text extractor cannot decode the CID characters and will return them 1128 as is. 1129 1130 see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output 1131 1132 :param: text: The text to check for undecoded CID characters 1133 :returns: 1134 A dictionary containing detection results 1135 1136 1137 <a id="pdfminer.PDFMinerToDocument.run"></a> 1138 1139 #### PDFMinerToDocument.run 1140 1141 ```python 1142 @component.output_types(documents=list[Document]) 1143 def run(sources: list[str | Path | ByteStream], 1144 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 1145 ``` 1146 1147 Converts PDF files to Documents. 1148 1149 **Arguments**: 1150 1151 - `sources`: List of PDF file paths or ByteStream objects. 1152 - `meta`: Optional metadata to attach to the Documents. 1153 This value can be either a list of dictionaries or a single dictionary. 1154 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1155 If it's a list, the length of the list must match the number of sources, because the two lists will 1156 be zipped. 1157 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1158 1159 **Returns**: 1160 1161 A dictionary with the following keys: 1162 - `documents`: Created Documents 1163 1164 <a id="pptx"></a> 1165 1166 ## Module pptx 1167 1168 <a id="pptx.PPTXToDocument"></a> 1169 1170 ### PPTXToDocument 1171 1172 Converts PPTX files to Documents. 1173 1174 Usage example: 1175 ```python 1176 from haystack.components.converters.pptx import PPTXToDocument 1177 1178 converter = PPTXToDocument() 1179 results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()}) 1180 documents = results["documents"] 1181 print(documents[0].content) 1182 # 'This is the text from the PPTX file.' 1183 ``` 1184 1185 <a id="pptx.PPTXToDocument.__init__"></a> 1186 1187 #### PPTXToDocument.\_\_init\_\_ 1188 1189 ```python 1190 def __init__(store_full_path: bool = False) 1191 ``` 1192 1193 Create an PPTXToDocument component. 1194 1195 **Arguments**: 1196 1197 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1198 If False, only the file name is stored. 1199 1200 <a id="pptx.PPTXToDocument.run"></a> 1201 1202 #### PPTXToDocument.run 1203 1204 ```python 1205 @component.output_types(documents=list[Document]) 1206 def run(sources: list[str | Path | ByteStream], 1207 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 1208 ``` 1209 1210 Converts PPTX files to Documents. 1211 1212 **Arguments**: 1213 1214 - `sources`: List of file paths or ByteStream objects. 1215 - `meta`: Optional metadata to attach to the Documents. 1216 This value can be either a list of dictionaries or a single dictionary. 1217 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1218 If it's a list, the length of the list must match the number of sources, because the two lists will 1219 be zipped. 1220 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1221 1222 **Returns**: 1223 1224 A dictionary with the following keys: 1225 - `documents`: Created Documents 1226 1227 <a id="pypdf"></a> 1228 1229 ## Module pypdf 1230 1231 <a id="pypdf.PyPDFExtractionMode"></a> 1232 1233 ### PyPDFExtractionMode 1234 1235 The mode to use for extracting text from a PDF. 1236 1237 <a id="pypdf.PyPDFExtractionMode.__str__"></a> 1238 1239 #### PyPDFExtractionMode.\_\_str\_\_ 1240 1241 ```python 1242 def __str__() -> str 1243 ``` 1244 1245 Convert a PyPDFExtractionMode enum to a string. 1246 1247 <a id="pypdf.PyPDFExtractionMode.from_str"></a> 1248 1249 #### PyPDFExtractionMode.from\_str 1250 1251 ```python 1252 @staticmethod 1253 def from_str(string: str) -> "PyPDFExtractionMode" 1254 ``` 1255 1256 Convert a string to a PyPDFExtractionMode enum. 1257 1258 <a id="pypdf.PyPDFToDocument"></a> 1259 1260 ### PyPDFToDocument 1261 1262 Converts PDF files to documents your pipeline can query. 1263 1264 This component uses the PyPDF library. 1265 You can attach metadata to the resulting documents. 1266 1267 ### Usage example 1268 1269 ```python 1270 from haystack.components.converters.pypdf import PyPDFToDocument 1271 1272 converter = PyPDFToDocument() 1273 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1274 documents = results["documents"] 1275 print(documents[0].content) 1276 # 'This is a text from the PDF file.' 1277 ``` 1278 1279 <a id="pypdf.PyPDFToDocument.__init__"></a> 1280 1281 #### PyPDFToDocument.\_\_init\_\_ 1282 1283 ```python 1284 def __init__(*, 1285 extraction_mode: str 1286 | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN, 1287 plain_mode_orientations: tuple = (0, 90, 180, 270), 1288 plain_mode_space_width: float = 200.0, 1289 layout_mode_space_vertically: bool = True, 1290 layout_mode_scale_weight: float = 1.25, 1291 layout_mode_strip_rotated: bool = True, 1292 layout_mode_font_height_weight: float = 1.0, 1293 store_full_path: bool = False) 1294 ``` 1295 1296 Create an PyPDFToDocument component. 1297 1298 **Arguments**: 1299 1300 - `extraction_mode`: The mode to use for extracting text from a PDF. 1301 Layout mode is an experimental mode that adheres to the rendered layout of the PDF. 1302 - `plain_mode_orientations`: Tuple of orientations to look for when extracting text from a PDF in plain mode. 1303 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1304 - `plain_mode_space_width`: Forces default space width if not extracted from font. 1305 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1306 - `layout_mode_space_vertically`: Whether to include blank lines inferred from y distance + font height. 1307 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1308 - `layout_mode_scale_weight`: Multiplier for string length when calculating weighted average character width. 1309 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1310 - `layout_mode_strip_rotated`: Layout mode does not support rotated text. Set to `False` to include rotated text anyway. 1311 If rotated text is discovered, layout will be degraded and a warning will be logged. 1312 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1313 - `layout_mode_font_height_weight`: Multiplier for font height when calculating blank line height. 1314 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1315 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1316 If False, only the file name is stored. 1317 1318 <a id="pypdf.PyPDFToDocument.to_dict"></a> 1319 1320 #### PyPDFToDocument.to\_dict 1321 1322 ```python 1323 def to_dict() 1324 ``` 1325 1326 Serializes the component to a dictionary. 1327 1328 **Returns**: 1329 1330 Dictionary with serialized data. 1331 1332 <a id="pypdf.PyPDFToDocument.from_dict"></a> 1333 1334 #### PyPDFToDocument.from\_dict 1335 1336 ```python 1337 @classmethod 1338 def from_dict(cls, data) 1339 ``` 1340 1341 Deserializes the component from a dictionary. 1342 1343 **Arguments**: 1344 1345 - `data`: Dictionary with serialized data. 1346 1347 **Returns**: 1348 1349 Deserialized component. 1350 1351 <a id="pypdf.PyPDFToDocument.run"></a> 1352 1353 #### PyPDFToDocument.run 1354 1355 ```python 1356 @component.output_types(documents=list[Document]) 1357 def run(sources: list[str | Path | ByteStream], 1358 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 1359 ``` 1360 1361 Converts PDF files to documents. 1362 1363 **Arguments**: 1364 1365 - `sources`: List of file paths or ByteStream objects to convert. 1366 - `meta`: Optional metadata to attach to the documents. 1367 This value can be a list of dictionaries or a single dictionary. 1368 If it's a single dictionary, its content is added to the metadata of all produced documents. 1369 If it's a list, its length must match the number of sources, as they are zipped together. 1370 For ByteStream objects, their `meta` is added to the output documents. 1371 1372 **Returns**: 1373 1374 A dictionary with the following keys: 1375 - `documents`: A list of converted documents. 1376 1377 <a id="tika"></a> 1378 1379 ## Module tika 1380 1381 <a id="tika.XHTMLParser"></a> 1382 1383 ### XHTMLParser 1384 1385 Custom parser to extract pages from Tika XHTML content. 1386 1387 <a id="tika.XHTMLParser.handle_starttag"></a> 1388 1389 #### XHTMLParser.handle\_starttag 1390 1391 ```python 1392 def handle_starttag(tag: str, attrs: list[tuple]) 1393 ``` 1394 1395 Identify the start of a page div. 1396 1397 <a id="tika.XHTMLParser.handle_endtag"></a> 1398 1399 #### XHTMLParser.handle\_endtag 1400 1401 ```python 1402 def handle_endtag(tag: str) 1403 ``` 1404 1405 Identify the end of a page div. 1406 1407 <a id="tika.XHTMLParser.handle_data"></a> 1408 1409 #### XHTMLParser.handle\_data 1410 1411 ```python 1412 def handle_data(data: str) 1413 ``` 1414 1415 Populate the page content. 1416 1417 <a id="tika.TikaDocumentConverter"></a> 1418 1419 ### TikaDocumentConverter 1420 1421 Converts files of different types to Documents. 1422 1423 This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, 1424 requires a running Tika server. 1425 For more options on running Tika, 1426 see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). 1427 1428 Usage example: 1429 ```python 1430 from haystack.components.converters.tika import TikaDocumentConverter 1431 1432 converter = TikaDocumentConverter() 1433 results = converter.run( 1434 sources=["sample.docx", "my_document.rtf", "archive.zip"], 1435 meta={"date_added": datetime.now().isoformat()} 1436 ) 1437 documents = results["documents"] 1438 print(documents[0].content) 1439 # 'This is a text from the docx file.' 1440 ``` 1441 1442 <a id="tika.TikaDocumentConverter.__init__"></a> 1443 1444 #### TikaDocumentConverter.\_\_init\_\_ 1445 1446 ```python 1447 def __init__(tika_url: str = "http://localhost:9998/tika", 1448 store_full_path: bool = False) 1449 ``` 1450 1451 Create a TikaDocumentConverter component. 1452 1453 **Arguments**: 1454 1455 - `tika_url`: Tika server URL. 1456 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1457 If False, only the file name is stored. 1458 1459 <a id="tika.TikaDocumentConverter.run"></a> 1460 1461 #### TikaDocumentConverter.run 1462 1463 ```python 1464 @component.output_types(documents=list[Document]) 1465 def run(sources: list[str | Path | ByteStream], 1466 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 1467 ``` 1468 1469 Converts files to Documents. 1470 1471 **Arguments**: 1472 1473 - `sources`: List of HTML file paths or ByteStream objects. 1474 - `meta`: Optional metadata to attach to the Documents. 1475 This value can be either a list of dictionaries or a single dictionary. 1476 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1477 If it's a list, the length of the list must match the number of sources, because the two lists will 1478 be zipped. 1479 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1480 1481 **Returns**: 1482 1483 A dictionary with the following keys: 1484 - `documents`: Created Documents 1485 1486 <a id="txt"></a> 1487 1488 ## Module txt 1489 1490 <a id="txt.TextFileToDocument"></a> 1491 1492 ### TextFileToDocument 1493 1494 Converts text files to documents your pipeline can query. 1495 1496 By default, it uses UTF-8 encoding when converting files but 1497 you can also set custom encoding. 1498 It can attach metadata to the resulting documents. 1499 1500 ### Usage example 1501 1502 ```python 1503 from haystack.components.converters.txt import TextFileToDocument 1504 1505 converter = TextFileToDocument() 1506 results = converter.run(sources=["sample.txt"]) 1507 documents = results["documents"] 1508 print(documents[0].content) 1509 # 'This is the content from the txt file.' 1510 ``` 1511 1512 <a id="txt.TextFileToDocument.__init__"></a> 1513 1514 #### TextFileToDocument.\_\_init\_\_ 1515 1516 ```python 1517 def __init__(encoding: str = "utf-8", store_full_path: bool = False) 1518 ``` 1519 1520 Creates a TextFileToDocument component. 1521 1522 **Arguments**: 1523 1524 - `encoding`: The encoding of the text files to convert. 1525 If the encoding is specified in the metadata of a source ByteStream, 1526 it overrides this value. 1527 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1528 If False, only the file name is stored. 1529 1530 <a id="txt.TextFileToDocument.run"></a> 1531 1532 #### TextFileToDocument.run 1533 1534 ```python 1535 @component.output_types(documents=list[Document]) 1536 def run(sources: list[str | Path | ByteStream], 1537 meta: dict[str, Any] | list[dict[str, Any]] | None = None) 1538 ``` 1539 1540 Converts text files to documents. 1541 1542 **Arguments**: 1543 1544 - `sources`: List of text file paths or ByteStream objects to convert. 1545 - `meta`: Optional metadata to attach to the documents. 1546 This value can be a list of dictionaries or a single dictionary. 1547 If it's a single dictionary, its content is added to the metadata of all produced documents. 1548 If it's a list, its length must match the number of sources as they're zipped together. 1549 For ByteStream objects, their `meta` is added to the output documents. 1550 1551 **Returns**: 1552 1553 A dictionary with the following keys: 1554 - `documents`: A list of converted documents. 1555 1556 <a id="xlsx"></a> 1557 1558 ## Module xlsx 1559 1560 <a id="xlsx.XLSXToDocument"></a> 1561 1562 ### XLSXToDocument 1563 1564 Converts XLSX (Excel) files into Documents. 1565 1566 Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is 1567 created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format. 1568 1569 ### Usage example 1570 1571 ```python 1572 from haystack.components.converters.xlsx import XLSXToDocument 1573 1574 converter = XLSXToDocument() 1575 results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()}) 1576 documents = results["documents"] 1577 print(documents[0].content) 1578 # ",A,B 1579 1,col_a,col_b 1580 2,1.5,test 1581 " 1582 ``` 1583 1584 <a id="xlsx.XLSXToDocument.__init__"></a> 1585 1586 #### XLSXToDocument.\_\_init\_\_ 1587 1588 ```python 1589 def __init__(table_format: Literal["csv", "markdown"] = "csv", 1590 sheet_name: str | int | list[str | int] | None = None, 1591 read_excel_kwargs: dict[str, Any] | None = None, 1592 table_format_kwargs: dict[str, Any] | None = None, 1593 *, 1594 store_full_path: bool = False) 1595 ``` 1596 1597 Creates a XLSXToDocument component. 1598 1599 **Arguments**: 1600 1601 - `table_format`: The format to convert the Excel file to. 1602 - `sheet_name`: The name of the sheet to read. If None, all sheets are read. 1603 - `read_excel_kwargs`: Additional arguments to pass to `pandas.read_excel`. 1604 See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel 1605 - `table_format_kwargs`: Additional keyword arguments to pass to the table format function. 1606 - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`. 1607 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv 1608 - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`. 1609 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown 1610 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1611 If False, only the file name is stored. 1612 1613 <a id="xlsx.XLSXToDocument.run"></a> 1614 1615 #### XLSXToDocument.run 1616 1617 ```python 1618 @component.output_types(documents=list[Document]) 1619 def run( 1620 sources: list[str | Path | ByteStream], 1621 meta: dict[str, Any] | list[dict[str, Any]] | None = None 1622 ) -> dict[str, list[Document]] 1623 ``` 1624 1625 Converts a XLSX file to a Document. 1626 1627 **Arguments**: 1628 1629 - `sources`: List of file paths or ByteStream objects. 1630 - `meta`: Optional metadata to attach to the documents. 1631 This value can be either a list of dictionaries or a single dictionary. 1632 If it's a single dictionary, its content is added to the metadata of all produced documents. 1633 If it's a list, the length of the list must match the number of sources, because the two lists will 1634 be zipped. 1635 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 1636 1637 **Returns**: 1638 1639 A dictionary with the following keys: 1640 - `documents`: Created documents 1641