converters_api.md
1 --- 2 title: "Converters" 3 id: converters-api 4 description: "Various converters to transform data from one format to another." 5 slug: "/converters-api" 6 --- 7 8 <a id="azure"></a> 9 10 ## Module azure 11 12 <a id="azure.AzureOCRDocumentConverter"></a> 13 14 ### AzureOCRDocumentConverter 15 16 Converts files to documents using Azure's Document Intelligence service. 17 18 Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 19 20 To use this component, you need an active Azure account 21 and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see 22 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 23 24 ### Usage example 25 26 ```python 27 from haystack.components.converters import AzureOCRDocumentConverter 28 from haystack.utils import Secret 29 30 converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>")) 31 results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()}) 32 documents = results["documents"] 33 print(documents[0].content) 34 # 'This is a text from the PDF file.' 35 ``` 36 37 <a id="azure.AzureOCRDocumentConverter.__init__"></a> 38 39 #### AzureOCRDocumentConverter.\_\_init\_\_ 40 41 ```python 42 def __init__(endpoint: str, 43 api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), 44 model_id: str = "prebuilt-read", 45 preceding_context_len: int = 3, 46 following_context_len: int = 3, 47 merge_multiple_column_headers: bool = True, 48 page_layout: Literal["natural", "single_column"] = "natural", 49 threshold_y: Optional[float] = 0.05, 50 store_full_path: bool = False) 51 ``` 52 53 Creates an AzureOCRDocumentConverter component. 54 55 **Arguments**: 56 57 - `endpoint`: The endpoint of your Azure resource. 58 - `api_key`: The API key of your Azure resource. 59 - `model_id`: The ID of the model you want to use. For a list of available models, see [Azure documentation] 60 (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). 61 - `preceding_context_len`: Number of lines before a table to include as preceding context 62 (this will be added to the metadata). 63 - `following_context_len`: Number of lines after a table to include as subsequent context ( 64 this will be added to the metadata). 65 - `merge_multiple_column_headers`: If `True`, merges multiple column header rows into a single row. 66 - `page_layout`: The type reading order to follow. Possible options: 67 - `natural`: Uses the natural reading order determined by Azure. 68 - `single_column`: Groups all lines with the same height on the page based on a threshold 69 determined by `threshold_y`. 70 - `threshold_y`: Only relevant if `single_column` is set to `page_layout`. 71 The threshold, in inches, to determine if two recognized PDF elements are grouped into a 72 single line. This is crucial for section headers or numbers which may be spatially separated 73 from the remaining text on the horizontal axis. 74 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 75 If False, only the file name is stored. 76 77 <a id="azure.AzureOCRDocumentConverter.run"></a> 78 79 #### AzureOCRDocumentConverter.run 80 81 ```python 82 @component.output_types(documents=list[Document], 83 raw_azure_response=list[dict]) 84 def run(sources: list[Union[str, Path, ByteStream]], 85 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 86 ``` 87 88 Convert a list of files to Documents using Azure's Document Intelligence service. 89 90 **Arguments**: 91 92 - `sources`: List of file paths or ByteStream objects. 93 - `meta`: Optional metadata to attach to the Documents. 94 This value can be either a list of dictionaries or a single dictionary. 95 If it's a single dictionary, its content is added to the metadata of all produced Documents. 96 If it's a list, the length of the list must match the number of sources, because the two lists will be 97 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 98 99 **Returns**: 100 101 A dictionary with the following keys: 102 - `documents`: List of created Documents 103 - `raw_azure_response`: List of raw Azure responses used to create the Documents 104 105 <a id="azure.AzureOCRDocumentConverter.to_dict"></a> 106 107 #### AzureOCRDocumentConverter.to\_dict 108 109 ```python 110 def to_dict() -> dict[str, Any] 111 ``` 112 113 Serializes the component to a dictionary. 114 115 **Returns**: 116 117 Dictionary with serialized data. 118 119 <a id="azure.AzureOCRDocumentConverter.from_dict"></a> 120 121 #### AzureOCRDocumentConverter.from\_dict 122 123 ```python 124 @classmethod 125 def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter" 126 ``` 127 128 Deserializes the component from a dictionary. 129 130 **Arguments**: 131 132 - `data`: The dictionary to deserialize from. 133 134 **Returns**: 135 136 The deserialized component. 137 138 <a id="csv"></a> 139 140 ## Module csv 141 142 <a id="csv.CSVToDocument"></a> 143 144 ### CSVToDocument 145 146 Converts CSV files to Documents. 147 148 By default, it uses UTF-8 encoding when converting files but 149 you can also set a custom encoding. 150 It can attach metadata to the resulting documents. 151 152 ### Usage example 153 154 ```python 155 from haystack.components.converters.csv import CSVToDocument 156 converter = CSVToDocument() 157 results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()}) 158 documents = results["documents"] 159 print(documents[0].content) 160 # 'col1,col2\nrow1,row1\nrow2,row2\n' 161 ``` 162 163 <a id="csv.CSVToDocument.__init__"></a> 164 165 #### CSVToDocument.\_\_init\_\_ 166 167 ```python 168 def __init__(encoding: str = "utf-8", 169 store_full_path: bool = False, 170 *, 171 conversion_mode: Literal["file", "row"] = "file", 172 delimiter: str = ",", 173 quotechar: str = '"') 174 ``` 175 176 Creates a CSVToDocument component. 177 178 **Arguments**: 179 180 - `encoding`: The encoding of the csv files to convert. 181 If the encoding is specified in the metadata of a source ByteStream, 182 it overrides this value. 183 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 184 If False, only the file name is stored. 185 - `conversion_mode`: - "file" (default): one Document per CSV file whose content is the raw CSV text. 186 - "row": convert each CSV row to its own Document (requires `content_column` in `run()`). 187 - `delimiter`: CSV delimiter used when parsing in row mode (passed to ``csv.DictReader``). 188 - `quotechar`: CSV quote character used when parsing in row mode (passed to ``csv.DictReader``). 189 190 <a id="csv.CSVToDocument.run"></a> 191 192 #### CSVToDocument.run 193 194 ```python 195 @component.output_types(documents=list[Document]) 196 def run(sources: list[Union[str, Path, ByteStream]], 197 *, 198 content_column: Optional[str] = None, 199 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 200 ``` 201 202 Converts CSV files to a Document (file mode) or to one Document per row (row mode). 203 204 **Arguments**: 205 206 - `sources`: List of file paths or ByteStream objects. 207 - `content_column`: **Required when** ``conversion_mode="row"``. 208 The column name whose values become ``Document.content`` for each row. 209 The column must exist in the CSV header. 210 - `meta`: Optional metadata to attach to the documents. 211 This value can be either a list of dictionaries or a single dictionary. 212 If it's a single dictionary, its content is added to the metadata of all produced documents. 213 If it's a list, the length of the list must match the number of sources, because the two lists will 214 be zipped. 215 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 216 217 **Returns**: 218 219 A dictionary with the following keys: 220 - `documents`: Created documents 221 222 <a id="docx"></a> 223 224 ## Module docx 225 226 <a id="docx.DOCXMetadata"></a> 227 228 ### DOCXMetadata 229 230 Describes the metadata of Docx file. 231 232 **Arguments**: 233 234 - `author`: The author 235 - `category`: The category 236 - `comments`: The comments 237 - `content_status`: The content status 238 - `created`: The creation date (ISO formatted string) 239 - `identifier`: The identifier 240 - `keywords`: Available keywords 241 - `language`: The language of the document 242 - `last_modified_by`: User who last modified the document 243 - `last_printed`: The last printed date (ISO formatted string) 244 - `modified`: The last modification date (ISO formatted string) 245 - `revision`: The revision number 246 - `subject`: The subject 247 - `title`: The title 248 - `version`: The version 249 250 <a id="docx.DOCXTableFormat"></a> 251 252 ### DOCXTableFormat 253 254 Supported formats for storing DOCX tabular data in a Document. 255 256 <a id="docx.DOCXTableFormat.from_str"></a> 257 258 #### DOCXTableFormat.from\_str 259 260 ```python 261 @staticmethod 262 def from_str(string: str) -> "DOCXTableFormat" 263 ``` 264 265 Convert a string to a DOCXTableFormat enum. 266 267 <a id="docx.DOCXLinkFormat"></a> 268 269 ### DOCXLinkFormat 270 271 Supported formats for storing DOCX link information in a Document. 272 273 <a id="docx.DOCXLinkFormat.from_str"></a> 274 275 #### DOCXLinkFormat.from\_str 276 277 ```python 278 @staticmethod 279 def from_str(string: str) -> "DOCXLinkFormat" 280 ``` 281 282 Convert a string to a DOCXLinkFormat enum. 283 284 <a id="docx.DOCXToDocument"></a> 285 286 ### DOCXToDocument 287 288 Converts DOCX files to Documents. 289 290 Uses `python-docx` library to convert the DOCX file to a document. 291 This component does not preserve page breaks in the original document. 292 293 Usage example: 294 ```python 295 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat 296 297 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN) 298 results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()}) 299 documents = results["documents"] 300 print(documents[0].content) 301 # 'This is a text from the DOCX file.' 302 ``` 303 304 <a id="docx.DOCXToDocument.__init__"></a> 305 306 #### DOCXToDocument.\_\_init\_\_ 307 308 ```python 309 def __init__(table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.CSV, 310 link_format: Union[str, DOCXLinkFormat] = DOCXLinkFormat.NONE, 311 store_full_path: bool = False) 312 ``` 313 314 Create a DOCXToDocument component. 315 316 **Arguments**: 317 318 - `table_format`: The format for table output. Can be either DOCXTableFormat.MARKDOWN, 319 DOCXTableFormat.CSV, "markdown", or "csv". 320 - `link_format`: The format for link output. Can be either: 321 DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`, 322 DOCXLinkFormat.PLAIN or "plain" to get text (address), 323 DOCXLinkFormat.NONE or "none" to get text without links. 324 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 325 If False, only the file name is stored. 326 327 <a id="docx.DOCXToDocument.to_dict"></a> 328 329 #### DOCXToDocument.to\_dict 330 331 ```python 332 def to_dict() -> dict[str, Any] 333 ``` 334 335 Serializes the component to a dictionary. 336 337 **Returns**: 338 339 Dictionary with serialized data. 340 341 <a id="docx.DOCXToDocument.from_dict"></a> 342 343 #### DOCXToDocument.from\_dict 344 345 ```python 346 @classmethod 347 def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument" 348 ``` 349 350 Deserializes the component from a dictionary. 351 352 **Arguments**: 353 354 - `data`: The dictionary to deserialize from. 355 356 **Returns**: 357 358 The deserialized component. 359 360 <a id="docx.DOCXToDocument.run"></a> 361 362 #### DOCXToDocument.run 363 364 ```python 365 @component.output_types(documents=list[Document]) 366 def run(sources: list[Union[str, Path, ByteStream]], 367 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 368 ``` 369 370 Converts DOCX files to Documents. 371 372 **Arguments**: 373 374 - `sources`: List of file paths or ByteStream objects. 375 - `meta`: Optional metadata to attach to the Documents. 376 This value can be either a list of dictionaries or a single dictionary. 377 If it's a single dictionary, its content is added to the metadata of all produced Documents. 378 If it's a list, the length of the list must match the number of sources, because the two lists will 379 be zipped. 380 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 381 382 **Returns**: 383 384 A dictionary with the following keys: 385 - `documents`: Created Documents 386 387 <a id="html"></a> 388 389 ## Module html 390 391 <a id="html.HTMLToDocument"></a> 392 393 ### HTMLToDocument 394 395 Converts an HTML file to a Document. 396 397 Usage example: 398 ```python 399 from haystack.components.converters import HTMLToDocument 400 401 converter = HTMLToDocument() 402 results = converter.run(sources=["path/to/sample.html"]) 403 documents = results["documents"] 404 print(documents[0].content) 405 # 'This is a text from the HTML file.' 406 ``` 407 408 <a id="html.HTMLToDocument.__init__"></a> 409 410 #### HTMLToDocument.\_\_init\_\_ 411 412 ```python 413 def __init__(extraction_kwargs: Optional[dict[str, Any]] = None, 414 store_full_path: bool = False) 415 ``` 416 417 Create an HTMLToDocument component. 418 419 **Arguments**: 420 421 - `extraction_kwargs`: A dictionary containing keyword arguments to customize the extraction process. These 422 are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see 423 the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 424 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 425 If False, only the file name is stored. 426 427 <a id="html.HTMLToDocument.to_dict"></a> 428 429 #### HTMLToDocument.to\_dict 430 431 ```python 432 def to_dict() -> dict[str, Any] 433 ``` 434 435 Serializes the component to a dictionary. 436 437 **Returns**: 438 439 Dictionary with serialized data. 440 441 <a id="html.HTMLToDocument.from_dict"></a> 442 443 #### HTMLToDocument.from\_dict 444 445 ```python 446 @classmethod 447 def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument" 448 ``` 449 450 Deserializes the component from a dictionary. 451 452 **Arguments**: 453 454 - `data`: The dictionary to deserialize from. 455 456 **Returns**: 457 458 The deserialized component. 459 460 <a id="html.HTMLToDocument.run"></a> 461 462 #### HTMLToDocument.run 463 464 ```python 465 @component.output_types(documents=list[Document]) 466 def run(sources: list[Union[str, Path, ByteStream]], 467 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None, 468 extraction_kwargs: Optional[dict[str, Any]] = None) 469 ``` 470 471 Converts a list of HTML files to Documents. 472 473 **Arguments**: 474 475 - `sources`: List of HTML file paths or ByteStream objects. 476 - `meta`: Optional metadata to attach to the Documents. 477 This value can be either a list of dictionaries or a single dictionary. 478 If it's a single dictionary, its content is added to the metadata of all produced Documents. 479 If it's a list, the length of the list must match the number of sources, because the two lists will 480 be zipped. 481 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 482 - `extraction_kwargs`: Additional keyword arguments to customize the extraction process. 483 484 **Returns**: 485 486 A dictionary with the following keys: 487 - `documents`: Created Documents 488 489 <a id="json"></a> 490 491 ## Module json 492 493 <a id="json.JSONConverter"></a> 494 495 ### JSONConverter 496 497 Converts one or more JSON files into a text document. 498 499 ### Usage examples 500 501 ```python 502 import json 503 504 from haystack.components.converters import JSONConverter 505 from haystack.dataclasses import ByteStream 506 507 source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"})) 508 509 converter = JSONConverter(content_key="text") 510 results = converter.run(sources=[source]) 511 documents = results["documents"] 512 print(documents[0].content) 513 # 'This is the content of my document' 514 ``` 515 516 Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields` 517 to extract from the filtered data: 518 519 ```python 520 import json 521 522 from haystack.components.converters import JSONConverter 523 from haystack.dataclasses import ByteStream 524 525 data = { 526 "laureates": [ 527 { 528 "firstname": "Enrico", 529 "surname": "Fermi", 530 "motivation": "for his demonstrations of the existence of new radioactive elements produced " 531 "by neutron irradiation, and for his related discovery of nuclear reactions brought about by" 532 " slow neutrons", 533 }, 534 { 535 "firstname": "Rita", 536 "surname": "Levi-Montalcini", 537 "motivation": "for their discoveries of growth factors", 538 }, 539 ], 540 } 541 source = ByteStream.from_string(json.dumps(data)) 542 converter = JSONConverter( 543 jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"} 544 ) 545 546 results = converter.run(sources=[source]) 547 documents = results["documents"] 548 print(documents[0].content) 549 # 'for his demonstrations of the existence of new radioactive elements produced by 550 # neutron irradiation, and for his related discovery of nuclear reactions brought 551 # about by slow neutrons' 552 553 print(documents[0].meta) 554 # {'firstname': 'Enrico', 'surname': 'Fermi'} 555 556 print(documents[1].content) 557 # 'for their discoveries of growth factors' 558 559 print(documents[1].meta) 560 # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'} 561 ``` 562 563 <a id="json.JSONConverter.__init__"></a> 564 565 #### JSONConverter.\_\_init\_\_ 566 567 ```python 568 def __init__(jq_schema: Optional[str] = None, 569 content_key: Optional[str] = None, 570 extra_meta_fields: Optional[Union[set[str], Literal["*"]]] = None, 571 store_full_path: bool = False) 572 ``` 573 574 Creates a JSONConverter component. 575 576 An optional `jq_schema` can be provided to extract nested data in the JSON source files. 577 See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax. 578 If `jq_schema` is not set, whole JSON source files will be used to extract content. 579 580 Optionally, you can provide a `content_key` to specify which key in the extracted object must 581 be set as the document's content. 582 583 If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in 584 the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped. 585 586 If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array, 587 it will be skipped. 588 589 If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped. 590 591 `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string. 592 If it's a set of strings, it must specify fields in the extracted objects that must be set in 593 the extracted documents. If a field is not found, the meta value will be `None`. 594 If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will 595 be saved as metadata. 596 597 Initialization will fail if neither `jq_schema` nor `content_key` are set. 598 599 **Arguments**: 600 601 - `jq_schema`: Optional jq filter string to extract content. 602 If not specified, whole JSON object will be used to extract information. 603 - `content_key`: Optional key to extract document content. 604 If `jq_schema` is specified, the `content_key` will be extracted from that object. 605 - `extra_meta_fields`: An optional set of meta keys to extract from the content. 606 If `jq_schema` is specified, all keys will be extracted from that object. 607 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 608 If False, only the file name is stored. 609 610 <a id="json.JSONConverter.to_dict"></a> 611 612 #### JSONConverter.to\_dict 613 614 ```python 615 def to_dict() -> dict[str, Any] 616 ``` 617 618 Serializes the component to a dictionary. 619 620 **Returns**: 621 622 Dictionary with serialized data. 623 624 <a id="json.JSONConverter.from_dict"></a> 625 626 #### JSONConverter.from\_dict 627 628 ```python 629 @classmethod 630 def from_dict(cls, data: dict[str, Any]) -> "JSONConverter" 631 ``` 632 633 Deserializes the component from a dictionary. 634 635 **Arguments**: 636 637 - `data`: Dictionary to deserialize from. 638 639 **Returns**: 640 641 Deserialized component. 642 643 <a id="json.JSONConverter.run"></a> 644 645 #### JSONConverter.run 646 647 ```python 648 @component.output_types(documents=list[Document]) 649 def run(sources: list[Union[str, Path, ByteStream]], 650 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 651 ``` 652 653 Converts a list of JSON files to documents. 654 655 **Arguments**: 656 657 - `sources`: A list of file paths or ByteStream objects. 658 - `meta`: Optional metadata to attach to the documents. 659 This value can be either a list of dictionaries or a single dictionary. 660 If it's a single dictionary, its content is added to the metadata of all produced documents. 661 If it's a list, the length of the list must match the number of sources. 662 If `sources` contain ByteStream objects, their `meta` will be added to the output documents. 663 664 **Returns**: 665 666 A dictionary with the following keys: 667 - `documents`: A list of created documents. 668 669 <a id="markdown"></a> 670 671 ## Module markdown 672 673 <a id="markdown.MarkdownToDocument"></a> 674 675 ### MarkdownToDocument 676 677 Converts a Markdown file into a text Document. 678 679 Usage example: 680 ```python 681 from haystack.components.converters import MarkdownToDocument 682 from datetime import datetime 683 684 converter = MarkdownToDocument() 685 results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()}) 686 documents = results["documents"] 687 print(documents[0].content) 688 # 'This is a text from the markdown file.' 689 ``` 690 691 <a id="markdown.MarkdownToDocument.__init__"></a> 692 693 #### MarkdownToDocument.\_\_init\_\_ 694 695 ```python 696 def __init__(table_to_single_line: bool = False, 697 progress_bar: bool = True, 698 store_full_path: bool = False) 699 ``` 700 701 Create a MarkdownToDocument component. 702 703 **Arguments**: 704 705 - `table_to_single_line`: If True converts table contents into a single line. 706 - `progress_bar`: If True shows a progress bar when running. 707 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 708 If False, only the file name is stored. 709 710 <a id="markdown.MarkdownToDocument.run"></a> 711 712 #### MarkdownToDocument.run 713 714 ```python 715 @component.output_types(documents=list[Document]) 716 def run(sources: list[Union[str, Path, ByteStream]], 717 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 718 ``` 719 720 Converts a list of Markdown files to Documents. 721 722 **Arguments**: 723 724 - `sources`: List of file paths or ByteStream objects. 725 - `meta`: Optional metadata to attach to the Documents. 726 This value can be either a list of dictionaries or a single dictionary. 727 If it's a single dictionary, its content is added to the metadata of all produced Documents. 728 If it's a list, the length of the list must match the number of sources, because the two lists will 729 be zipped. 730 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 731 732 **Returns**: 733 734 A dictionary with the following keys: 735 - `documents`: List of created Documents 736 737 <a id="msg"></a> 738 739 ## Module msg 740 741 <a id="msg.MSGToDocument"></a> 742 743 ### MSGToDocument 744 745 Converts Microsoft Outlook .msg files into Haystack Documents. 746 747 This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg 748 files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg 749 file are extracted as ByteStream objects. 750 751 ### Example Usage 752 753 ```python 754 from haystack.components.converters.msg import MSGToDocument 755 from datetime import datetime 756 757 converter = MSGToDocument() 758 results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()}) 759 documents = results["documents"] 760 attachments = results["attachments"] 761 print(documents[0].content) 762 ``` 763 764 <a id="msg.MSGToDocument.__init__"></a> 765 766 #### MSGToDocument.\_\_init\_\_ 767 768 ```python 769 def __init__(store_full_path: bool = False) -> None 770 ``` 771 772 Creates a MSGToDocument component. 773 774 **Arguments**: 775 776 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 777 If False, only the file name is stored. 778 779 <a id="msg.MSGToDocument.run"></a> 780 781 #### MSGToDocument.run 782 783 ```python 784 @component.output_types(documents=list[Document], attachments=list[ByteStream]) 785 def run( 786 sources: list[Union[str, Path, ByteStream]], 787 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None 788 ) -> dict[str, Union[list[Document], list[ByteStream]]] 789 ``` 790 791 Converts MSG files to Documents. 792 793 **Arguments**: 794 795 - `sources`: List of file paths or ByteStream objects. 796 - `meta`: Optional metadata to attach to the Documents. 797 This value can be either a list of dictionaries or a single dictionary. 798 If it's a single dictionary, its content is added to the metadata of all produced Documents. 799 If it's a list, the length of the list must match the number of sources, because the two lists will 800 be zipped. 801 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 802 803 **Returns**: 804 805 A dictionary with the following keys: 806 - `documents`: Created Documents. 807 - `attachments`: Created ByteStream objects from file attachments. 808 809 <a id="multi_file_converter"></a> 810 811 ## Module multi\_file\_converter 812 813 <a id="multi_file_converter.MultiFileConverter"></a> 814 815 ### MultiFileConverter 816 817 A file converter that handles conversion of multiple file types. 818 819 The MultiFileConverter handles the following file types: 820 - CSV 821 - DOCX 822 - HTML 823 - JSON 824 - MD 825 - TEXT 826 - PDF (no OCR) 827 - PPTX 828 - XLSX 829 830 Usage example: 831 ``` 832 from haystack.super_components.converters import MultiFileConverter 833 834 converter = MultiFileConverter() 835 converter.run(sources=["test.txt", "test.pdf"], meta={}) 836 ``` 837 838 <a id="multi_file_converter.MultiFileConverter.__init__"></a> 839 840 #### MultiFileConverter.\_\_init\_\_ 841 842 ```python 843 def __init__(encoding: str = "utf-8", 844 json_content_key: str = "content") -> None 845 ``` 846 847 Initialize the MultiFileConverter. 848 849 **Arguments**: 850 851 - `encoding`: The encoding to use when reading files. 852 - `json_content_key`: The key to use in a content field in a document when converting JSON files. 853 854 <a id="openapi_functions"></a> 855 856 ## Module openapi\_functions 857 858 <a id="openapi_functions.OpenAPIServiceToFunctions"></a> 859 860 ### OpenAPIServiceToFunctions 861 862 Converts OpenAPI service definitions to a format suitable for OpenAI function calling. 863 864 The definition must respect OpenAPI specification 3.0.0 or higher. 865 It can be specified in JSON or YAML format. 866 Each function must have: 867 - unique operationId 868 - description 869 - requestBody and/or parameters 870 - schema for the requestBody and/or parameters 871 For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification). 872 For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling). 873 874 Usage example: 875 ```python 876 from haystack.components.converters import OpenAPIServiceToFunctions 877 878 converter = OpenAPIServiceToFunctions() 879 result = converter.run(sources=["path/to/openapi_definition.yaml"]) 880 assert result["functions"] 881 ``` 882 883 <a id="openapi_functions.OpenAPIServiceToFunctions.__init__"></a> 884 885 #### OpenAPIServiceToFunctions.\_\_init\_\_ 886 887 ```python 888 def __init__() 889 ``` 890 891 Create an OpenAPIServiceToFunctions component. 892 893 <a id="openapi_functions.OpenAPIServiceToFunctions.run"></a> 894 895 #### OpenAPIServiceToFunctions.run 896 897 ```python 898 @component.output_types(functions=list[dict[str, Any]], 899 openapi_specs=list[dict[str, Any]]) 900 def run(sources: list[Union[str, Path, ByteStream]]) -> dict[str, Any] 901 ``` 902 903 Converts OpenAPI definitions in OpenAI function calling format. 904 905 **Arguments**: 906 907 - `sources`: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format). 908 909 **Raises**: 910 911 - `RuntimeError`: If the OpenAPI definitions cannot be downloaded or processed. 912 - `ValueError`: If the source type is not recognized or no functions are found in the OpenAPI definitions. 913 914 **Returns**: 915 916 A dictionary with the following keys: 917 - functions: Function definitions in JSON object format 918 - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references 919 920 <a id="output_adapter"></a> 921 922 ## Module output\_adapter 923 924 <a id="output_adapter.OutputAdaptationException"></a> 925 926 ### OutputAdaptationException 927 928 Exception raised when there is an error during output adaptation. 929 930 <a id="output_adapter.OutputAdapter"></a> 931 932 ### OutputAdapter 933 934 Adapts output of a Component using Jinja templates. 935 936 Usage example: 937 ```python 938 from haystack import Document 939 from haystack.components.converters import OutputAdapter 940 941 adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str) 942 documents = [Document(content="Test content"] 943 result = adapter.run(documents=documents) 944 945 assert result["output"] == "Test content" 946 ``` 947 948 <a id="output_adapter.OutputAdapter.__init__"></a> 949 950 #### OutputAdapter.\_\_init\_\_ 951 952 ```python 953 def __init__(template: str, 954 output_type: TypeAlias, 955 custom_filters: Optional[dict[str, Callable]] = None, 956 unsafe: bool = False) 957 ``` 958 959 Create an OutputAdapter component. 960 961 **Arguments**: 962 963 - `template`: A Jinja template that defines how to adapt the input data. 964 The variables in the template define the input of this instance. 965 e.g. 966 With this template: 967 ``` 968 {{ documents[0].content }} 969 ``` 970 The Component input will be `documents`. 971 - `output_type`: The type of output this instance will return. 972 - `custom_filters`: A dictionary of custom Jinja filters used in the template. 973 - `unsafe`: Enable execution of arbitrary code in the Jinja template. 974 This should only be used if you trust the source of the template as it can be lead to remote code execution. 975 976 <a id="output_adapter.OutputAdapter.run"></a> 977 978 #### OutputAdapter.run 979 980 ```python 981 def run(**kwargs) 982 ``` 983 984 Renders the Jinja template with the provided inputs. 985 986 **Arguments**: 987 988 - `kwargs`: Must contain all variables used in the `template` string. 989 990 **Raises**: 991 992 - `OutputAdaptationException`: If template rendering fails. 993 994 **Returns**: 995 996 A dictionary with the following keys: 997 - `output`: Rendered Jinja template. 998 999 <a id="output_adapter.OutputAdapter.to_dict"></a> 1000 1001 #### OutputAdapter.to\_dict 1002 1003 ```python 1004 def to_dict() -> dict[str, Any] 1005 ``` 1006 1007 Serializes the component to a dictionary. 1008 1009 **Returns**: 1010 1011 Dictionary with serialized data. 1012 1013 <a id="output_adapter.OutputAdapter.from_dict"></a> 1014 1015 #### OutputAdapter.from\_dict 1016 1017 ```python 1018 @classmethod 1019 def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter" 1020 ``` 1021 1022 Deserializes the component from a dictionary. 1023 1024 **Arguments**: 1025 1026 - `data`: The dictionary to deserialize from. 1027 1028 **Returns**: 1029 1030 The deserialized component. 1031 1032 <a id="pdfminer"></a> 1033 1034 ## Module pdfminer 1035 1036 <a id="pdfminer.CID_PATTERN"></a> 1037 1038 #### CID\_PATTERN 1039 1040 regex pattern to detect CID characters 1041 1042 <a id="pdfminer.PDFMinerToDocument"></a> 1043 1044 ### PDFMinerToDocument 1045 1046 Converts PDF files to Documents. 1047 1048 Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/ 1049 1050 Usage example: 1051 ```python 1052 from haystack.components.converters.pdfminer import PDFMinerToDocument 1053 1054 converter = PDFMinerToDocument() 1055 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1056 documents = results["documents"] 1057 print(documents[0].content) 1058 # 'This is a text from the PDF file.' 1059 ``` 1060 1061 <a id="pdfminer.PDFMinerToDocument.__init__"></a> 1062 1063 #### PDFMinerToDocument.\_\_init\_\_ 1064 1065 ```python 1066 def __init__(line_overlap: float = 0.5, 1067 char_margin: float = 2.0, 1068 line_margin: float = 0.5, 1069 word_margin: float = 0.1, 1070 boxes_flow: Optional[float] = 0.5, 1071 detect_vertical: bool = True, 1072 all_texts: bool = False, 1073 store_full_path: bool = False) -> None 1074 ``` 1075 1076 Create a PDFMinerToDocument component. 1077 1078 **Arguments**: 1079 1080 - `line_overlap`: This parameter determines whether two characters are considered to be on 1081 the same line based on the amount of overlap between them. 1082 The overlap is calculated relative to the minimum height of both characters. 1083 - `char_margin`: Determines whether two characters are part of the same line based on the distance between them. 1084 If the distance is less than the margin specified, the characters are considered to be on the same line. 1085 The margin is calculated relative to the width of the character. 1086 - `word_margin`: Determines whether two characters on the same line are part of the same word 1087 based on the distance between them. If the distance is greater than the margin specified, 1088 an intermediate space will be added between them to make the text more readable. 1089 The margin is calculated relative to the width of the character. 1090 - `line_margin`: This parameter determines whether two lines are part of the same paragraph based on 1091 the distance between them. If the distance is less than the margin specified, 1092 the lines are considered to be part of the same paragraph. 1093 The margin is calculated relative to the height of a line. 1094 - `boxes_flow`: This parameter determines the importance of horizontal and vertical position when 1095 determining the order of text boxes. A value between -1.0 and +1.0 can be set, 1096 with -1.0 indicating that only horizontal position matters and +1.0 indicating 1097 that only vertical position matters. Setting the value to 'None' will disable advanced 1098 layout analysis, and text boxes will be ordered based on the position of their bottom left corner. 1099 - `detect_vertical`: This parameter determines whether vertical text should be considered during layout analysis. 1100 - `all_texts`: If layout analysis should be performed on text in figures. 1101 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1102 If False, only the file name is stored. 1103 1104 <a id="pdfminer.PDFMinerToDocument.detect_undecoded_cid_characters"></a> 1105 1106 #### PDFMinerToDocument.detect\_undecoded\_cid\_characters 1107 1108 ```python 1109 def detect_undecoded_cid_characters(text: str) -> dict[str, Any] 1110 ``` 1111 1112 Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format. 1113 1114 This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses 1115 non-standard fonts. 1116 1117 A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like 1118 searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor 1119 needs. If that map is not available the text extractor cannot decode the CID characters and will return them 1120 as is. 1121 1122 see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output 1123 1124 :param: text: The text to check for undecoded CID characters 1125 :returns: 1126 A dictionary containing detection results 1127 1128 1129 <a id="pdfminer.PDFMinerToDocument.run"></a> 1130 1131 #### PDFMinerToDocument.run 1132 1133 ```python 1134 @component.output_types(documents=list[Document]) 1135 def run(sources: list[Union[str, Path, ByteStream]], 1136 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 1137 ``` 1138 1139 Converts PDF files to Documents. 1140 1141 **Arguments**: 1142 1143 - `sources`: List of PDF file paths or ByteStream objects. 1144 - `meta`: Optional metadata to attach to the Documents. 1145 This value can be either a list of dictionaries or a single dictionary. 1146 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1147 If it's a list, the length of the list must match the number of sources, because the two lists will 1148 be zipped. 1149 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1150 1151 **Returns**: 1152 1153 A dictionary with the following keys: 1154 - `documents`: Created Documents 1155 1156 <a id="pptx"></a> 1157 1158 ## Module pptx 1159 1160 <a id="pptx.PPTXToDocument"></a> 1161 1162 ### PPTXToDocument 1163 1164 Converts PPTX files to Documents. 1165 1166 Usage example: 1167 ```python 1168 from haystack.components.converters.pptx import PPTXToDocument 1169 1170 converter = PPTXToDocument() 1171 results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()}) 1172 documents = results["documents"] 1173 print(documents[0].content) 1174 # 'This is the text from the PPTX file.' 1175 ``` 1176 1177 <a id="pptx.PPTXToDocument.__init__"></a> 1178 1179 #### PPTXToDocument.\_\_init\_\_ 1180 1181 ```python 1182 def __init__(store_full_path: bool = False) 1183 ``` 1184 1185 Create an PPTXToDocument component. 1186 1187 **Arguments**: 1188 1189 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1190 If False, only the file name is stored. 1191 1192 <a id="pptx.PPTXToDocument.run"></a> 1193 1194 #### PPTXToDocument.run 1195 1196 ```python 1197 @component.output_types(documents=list[Document]) 1198 def run(sources: list[Union[str, Path, ByteStream]], 1199 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 1200 ``` 1201 1202 Converts PPTX files to Documents. 1203 1204 **Arguments**: 1205 1206 - `sources`: List of file paths or ByteStream objects. 1207 - `meta`: Optional metadata to attach to the Documents. 1208 This value can be either a list of dictionaries or a single dictionary. 1209 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1210 If it's a list, the length of the list must match the number of sources, because the two lists will 1211 be zipped. 1212 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1213 1214 **Returns**: 1215 1216 A dictionary with the following keys: 1217 - `documents`: Created Documents 1218 1219 <a id="pypdf"></a> 1220 1221 ## Module pypdf 1222 1223 <a id="pypdf.PyPDFExtractionMode"></a> 1224 1225 ### PyPDFExtractionMode 1226 1227 The mode to use for extracting text from a PDF. 1228 1229 <a id="pypdf.PyPDFExtractionMode.__str__"></a> 1230 1231 #### PyPDFExtractionMode.\_\_str\_\_ 1232 1233 ```python 1234 def __str__() -> str 1235 ``` 1236 1237 Convert a PyPDFExtractionMode enum to a string. 1238 1239 <a id="pypdf.PyPDFExtractionMode.from_str"></a> 1240 1241 #### PyPDFExtractionMode.from\_str 1242 1243 ```python 1244 @staticmethod 1245 def from_str(string: str) -> "PyPDFExtractionMode" 1246 ``` 1247 1248 Convert a string to a PyPDFExtractionMode enum. 1249 1250 <a id="pypdf.PyPDFToDocument"></a> 1251 1252 ### PyPDFToDocument 1253 1254 Converts PDF files to documents your pipeline can query. 1255 1256 This component uses the PyPDF library. 1257 You can attach metadata to the resulting documents. 1258 1259 ### Usage example 1260 1261 ```python 1262 from haystack.components.converters.pypdf import PyPDFToDocument 1263 1264 converter = PyPDFToDocument() 1265 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1266 documents = results["documents"] 1267 print(documents[0].content) 1268 # 'This is a text from the PDF file.' 1269 ``` 1270 1271 <a id="pypdf.PyPDFToDocument.__init__"></a> 1272 1273 #### PyPDFToDocument.\_\_init\_\_ 1274 1275 ```python 1276 def __init__(*, 1277 extraction_mode: Union[ 1278 str, PyPDFExtractionMode] = PyPDFExtractionMode.PLAIN, 1279 plain_mode_orientations: tuple = (0, 90, 180, 270), 1280 plain_mode_space_width: float = 200.0, 1281 layout_mode_space_vertically: bool = True, 1282 layout_mode_scale_weight: float = 1.25, 1283 layout_mode_strip_rotated: bool = True, 1284 layout_mode_font_height_weight: float = 1.0, 1285 store_full_path: bool = False) 1286 ``` 1287 1288 Create an PyPDFToDocument component. 1289 1290 **Arguments**: 1291 1292 - `extraction_mode`: The mode to use for extracting text from a PDF. 1293 Layout mode is an experimental mode that adheres to the rendered layout of the PDF. 1294 - `plain_mode_orientations`: Tuple of orientations to look for when extracting text from a PDF in plain mode. 1295 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1296 - `plain_mode_space_width`: Forces default space width if not extracted from font. 1297 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1298 - `layout_mode_space_vertically`: Whether to include blank lines inferred from y distance + font height. 1299 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1300 - `layout_mode_scale_weight`: Multiplier for string length when calculating weighted average character width. 1301 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1302 - `layout_mode_strip_rotated`: Layout mode does not support rotated text. Set to `False` to include rotated text anyway. 1303 If rotated text is discovered, layout will be degraded and a warning will be logged. 1304 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1305 - `layout_mode_font_height_weight`: Multiplier for font height when calculating blank line height. 1306 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1307 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1308 If False, only the file name is stored. 1309 1310 <a id="pypdf.PyPDFToDocument.to_dict"></a> 1311 1312 #### PyPDFToDocument.to\_dict 1313 1314 ```python 1315 def to_dict() 1316 ``` 1317 1318 Serializes the component to a dictionary. 1319 1320 **Returns**: 1321 1322 Dictionary with serialized data. 1323 1324 <a id="pypdf.PyPDFToDocument.from_dict"></a> 1325 1326 #### PyPDFToDocument.from\_dict 1327 1328 ```python 1329 @classmethod 1330 def from_dict(cls, data) 1331 ``` 1332 1333 Deserializes the component from a dictionary. 1334 1335 **Arguments**: 1336 1337 - `data`: Dictionary with serialized data. 1338 1339 **Returns**: 1340 1341 Deserialized component. 1342 1343 <a id="pypdf.PyPDFToDocument.run"></a> 1344 1345 #### PyPDFToDocument.run 1346 1347 ```python 1348 @component.output_types(documents=list[Document]) 1349 def run(sources: list[Union[str, Path, ByteStream]], 1350 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 1351 ``` 1352 1353 Converts PDF files to documents. 1354 1355 **Arguments**: 1356 1357 - `sources`: List of file paths or ByteStream objects to convert. 1358 - `meta`: Optional metadata to attach to the documents. 1359 This value can be a list of dictionaries or a single dictionary. 1360 If it's a single dictionary, its content is added to the metadata of all produced documents. 1361 If it's a list, its length must match the number of sources, as they are zipped together. 1362 For ByteStream objects, their `meta` is added to the output documents. 1363 1364 **Returns**: 1365 1366 A dictionary with the following keys: 1367 - `documents`: A list of converted documents. 1368 1369 <a id="tika"></a> 1370 1371 ## Module tika 1372 1373 <a id="tika.XHTMLParser"></a> 1374 1375 ### XHTMLParser 1376 1377 Custom parser to extract pages from Tika XHTML content. 1378 1379 <a id="tika.XHTMLParser.handle_starttag"></a> 1380 1381 #### XHTMLParser.handle\_starttag 1382 1383 ```python 1384 def handle_starttag(tag: str, attrs: list[tuple]) 1385 ``` 1386 1387 Identify the start of a page div. 1388 1389 <a id="tika.XHTMLParser.handle_endtag"></a> 1390 1391 #### XHTMLParser.handle\_endtag 1392 1393 ```python 1394 def handle_endtag(tag: str) 1395 ``` 1396 1397 Identify the end of a page div. 1398 1399 <a id="tika.XHTMLParser.handle_data"></a> 1400 1401 #### XHTMLParser.handle\_data 1402 1403 ```python 1404 def handle_data(data: str) 1405 ``` 1406 1407 Populate the page content. 1408 1409 <a id="tika.TikaDocumentConverter"></a> 1410 1411 ### TikaDocumentConverter 1412 1413 Converts files of different types to Documents. 1414 1415 This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, 1416 requires a running Tika server. 1417 For more options on running Tika, 1418 see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). 1419 1420 Usage example: 1421 ```python 1422 from haystack.components.converters.tika import TikaDocumentConverter 1423 1424 converter = TikaDocumentConverter() 1425 results = converter.run( 1426 sources=["sample.docx", "my_document.rtf", "archive.zip"], 1427 meta={"date_added": datetime.now().isoformat()} 1428 ) 1429 documents = results["documents"] 1430 print(documents[0].content) 1431 # 'This is a text from the docx file.' 1432 ``` 1433 1434 <a id="tika.TikaDocumentConverter.__init__"></a> 1435 1436 #### TikaDocumentConverter.\_\_init\_\_ 1437 1438 ```python 1439 def __init__(tika_url: str = "http://localhost:9998/tika", 1440 store_full_path: bool = False) 1441 ``` 1442 1443 Create a TikaDocumentConverter component. 1444 1445 **Arguments**: 1446 1447 - `tika_url`: Tika server URL. 1448 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1449 If False, only the file name is stored. 1450 1451 <a id="tika.TikaDocumentConverter.run"></a> 1452 1453 #### TikaDocumentConverter.run 1454 1455 ```python 1456 @component.output_types(documents=list[Document]) 1457 def run(sources: list[Union[str, Path, ByteStream]], 1458 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 1459 ``` 1460 1461 Converts files to Documents. 1462 1463 **Arguments**: 1464 1465 - `sources`: List of HTML file paths or ByteStream objects. 1466 - `meta`: Optional metadata to attach to the Documents. 1467 This value can be either a list of dictionaries or a single dictionary. 1468 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1469 If it's a list, the length of the list must match the number of sources, because the two lists will 1470 be zipped. 1471 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1472 1473 **Returns**: 1474 1475 A dictionary with the following keys: 1476 - `documents`: Created Documents 1477 1478 <a id="txt"></a> 1479 1480 ## Module txt 1481 1482 <a id="txt.TextFileToDocument"></a> 1483 1484 ### TextFileToDocument 1485 1486 Converts text files to documents your pipeline can query. 1487 1488 By default, it uses UTF-8 encoding when converting files but 1489 you can also set custom encoding. 1490 It can attach metadata to the resulting documents. 1491 1492 ### Usage example 1493 1494 ```python 1495 from haystack.components.converters.txt import TextFileToDocument 1496 1497 converter = TextFileToDocument() 1498 results = converter.run(sources=["sample.txt"]) 1499 documents = results["documents"] 1500 print(documents[0].content) 1501 # 'This is the content from the txt file.' 1502 ``` 1503 1504 <a id="txt.TextFileToDocument.__init__"></a> 1505 1506 #### TextFileToDocument.\_\_init\_\_ 1507 1508 ```python 1509 def __init__(encoding: str = "utf-8", store_full_path: bool = False) 1510 ``` 1511 1512 Creates a TextFileToDocument component. 1513 1514 **Arguments**: 1515 1516 - `encoding`: The encoding of the text files to convert. 1517 If the encoding is specified in the metadata of a source ByteStream, 1518 it overrides this value. 1519 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1520 If False, only the file name is stored. 1521 1522 <a id="txt.TextFileToDocument.run"></a> 1523 1524 #### TextFileToDocument.run 1525 1526 ```python 1527 @component.output_types(documents=list[Document]) 1528 def run(sources: list[Union[str, Path, ByteStream]], 1529 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None) 1530 ``` 1531 1532 Converts text files to documents. 1533 1534 **Arguments**: 1535 1536 - `sources`: List of text file paths or ByteStream objects to convert. 1537 - `meta`: Optional metadata to attach to the documents. 1538 This value can be a list of dictionaries or a single dictionary. 1539 If it's a single dictionary, its content is added to the metadata of all produced documents. 1540 If it's a list, its length must match the number of sources as they're zipped together. 1541 For ByteStream objects, their `meta` is added to the output documents. 1542 1543 **Returns**: 1544 1545 A dictionary with the following keys: 1546 - `documents`: A list of converted documents. 1547 1548 <a id="xlsx"></a> 1549 1550 ## Module xlsx 1551 1552 <a id="xlsx.XLSXToDocument"></a> 1553 1554 ### XLSXToDocument 1555 1556 Converts XLSX (Excel) files into Documents. 1557 1558 Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is 1559 created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format. 1560 1561 ### Usage example 1562 1563 ```python 1564 from haystack.components.converters.xlsx import XLSXToDocument 1565 1566 converter = XLSXToDocument() 1567 results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()}) 1568 documents = results["documents"] 1569 print(documents[0].content) 1570 # ",A,B 1571 1,col_a,col_b 1572 2,1.5,test 1573 " 1574 ``` 1575 1576 <a id="xlsx.XLSXToDocument.__init__"></a> 1577 1578 #### XLSXToDocument.\_\_init\_\_ 1579 1580 ```python 1581 def __init__(table_format: Literal["csv", "markdown"] = "csv", 1582 sheet_name: Union[str, int, list[Union[str, int]], None] = None, 1583 read_excel_kwargs: Optional[dict[str, Any]] = None, 1584 table_format_kwargs: Optional[dict[str, Any]] = None, 1585 *, 1586 store_full_path: bool = False) 1587 ``` 1588 1589 Creates a XLSXToDocument component. 1590 1591 **Arguments**: 1592 1593 - `table_format`: The format to convert the Excel file to. 1594 - `sheet_name`: The name of the sheet to read. If None, all sheets are read. 1595 - `read_excel_kwargs`: Additional arguments to pass to `pandas.read_excel`. 1596 See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel 1597 - `table_format_kwargs`: Additional keyword arguments to pass to the table format function. 1598 - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`. 1599 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv 1600 - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`. 1601 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown 1602 - `store_full_path`: If True, the full path of the file is stored in the metadata of the document. 1603 If False, only the file name is stored. 1604 1605 <a id="xlsx.XLSXToDocument.run"></a> 1606 1607 #### XLSXToDocument.run 1608 1609 ```python 1610 @component.output_types(documents=list[Document]) 1611 def run( 1612 sources: list[Union[str, Path, ByteStream]], 1613 meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None 1614 ) -> dict[str, list[Document]] 1615 ``` 1616 1617 Converts a XLSX file to a Document. 1618 1619 **Arguments**: 1620 1621 - `sources`: List of file paths or ByteStream objects. 1622 - `meta`: Optional metadata to attach to the documents. 1623 This value can be either a list of dictionaries or a single dictionary. 1624 If it's a single dictionary, its content is added to the metadata of all produced documents. 1625 If it's a list, the length of the list must match the number of sources, because the two lists will 1626 be zipped. 1627 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 1628 1629 **Returns**: 1630 1631 A dictionary with the following keys: 1632 - `documents`: Created documents 1633