converters_api.md
1 --- 2 title: "Converters" 3 id: converters-api 4 description: "Various converters to transform data from one format to another." 5 slug: "/converters-api" 6 --- 7 8 9 ## azure 10 11 ### AzureOCRDocumentConverter 12 13 Converts files to documents using Azure's Document Intelligence service. 14 15 Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 16 17 To use this component, you need an active Azure account 18 and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see 19 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 20 21 ### Usage example 22 23 <!-- test-ignore --> 24 25 ```python 26 import os 27 from datetime import datetime 28 from haystack.components.converters import AzureOCRDocumentConverter 29 from haystack.utils import Secret 30 31 converter = AzureOCRDocumentConverter( 32 endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"], 33 api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"), 34 ) 35 results = converter.run( 36 sources=["test/test_files/pdf/react_paper.pdf"], 37 meta={"date_added": datetime.now().isoformat()}, 38 ) 39 documents = results["documents"] 40 print(documents[0].content) 41 # 'This is a text from the PDF file.' 42 ``` 43 44 #### __init__ 45 46 ```python 47 __init__( 48 endpoint: str, 49 api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), 50 model_id: str = "prebuilt-read", 51 preceding_context_len: int = 3, 52 following_context_len: int = 3, 53 merge_multiple_column_headers: bool = True, 54 page_layout: Literal["natural", "single_column"] = "natural", 55 threshold_y: float | None = 0.05, 56 store_full_path: bool = False, 57 ) -> None 58 ``` 59 60 Creates an AzureOCRDocumentConverter component. 61 62 **Parameters:** 63 64 - **endpoint** (<code>str</code>) – The endpoint of your Azure resource. 65 - **api_key** (<code>Secret</code>) – The API key of your Azure resource. 66 - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation] 67 (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). 68 - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context 69 (this will be added to the metadata). 70 - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context ( 71 this will be added to the metadata). 72 - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row. 73 - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options: 74 - `natural`: Uses the natural reading order determined by Azure. 75 - `single_column`: Groups all lines with the same height on the page based on a threshold 76 determined by `threshold_y`. 77 - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`. 78 The threshold, in inches, to determine if two recognized PDF elements are grouped into a 79 single line. This is crucial for section headers or numbers which may be spatially separated 80 from the remaining text on the horizontal axis. 81 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 82 If False, only the file name is stored. 83 84 #### run 85 86 ```python 87 run( 88 sources: list[str | Path | ByteStream], 89 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 90 ) -> dict[str, Any] 91 ``` 92 93 Convert a list of files to Documents using Azure's Document Intelligence service. 94 95 **Parameters:** 96 97 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 98 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 99 This value can be either a list of dictionaries or a single dictionary. 100 If it's a single dictionary, its content is added to the metadata of all produced Documents. 101 If it's a list, the length of the list must match the number of sources, because the two lists will be 102 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 103 104 **Returns:** 105 106 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 107 - `documents`: List of created Documents 108 - `raw_azure_response`: List of raw Azure responses used to create the Documents 109 110 #### to_dict 111 112 ```python 113 to_dict() -> dict[str, Any] 114 ``` 115 116 Serializes the component to a dictionary. 117 118 **Returns:** 119 120 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 121 122 #### from_dict 123 124 ```python 125 from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter 126 ``` 127 128 Deserializes the component from a dictionary. 129 130 **Parameters:** 131 132 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 133 134 **Returns:** 135 136 - <code>AzureOCRDocumentConverter</code> – The deserialized component. 137 138 ## csv 139 140 ### CSVToDocument 141 142 Converts CSV files to Documents. 143 144 By default, it uses UTF-8 encoding when converting files but 145 you can also set a custom encoding. 146 It can attach metadata to the resulting documents. 147 148 ### Usage example 149 150 ```python 151 from haystack.components.converters.csv import CSVToDocument 152 from datetime import datetime 153 154 converter = CSVToDocument() 155 results = converter.run( 156 sources=["test/test_files/csv/sample_1.csv"], meta={"date_added": datetime.now().isoformat()} 157 ) 158 documents = results["documents"] 159 160 print(documents[0].content) 161 # >> 'col1,col2\nrow1,row1\nrow2,row2\n' 162 ``` 163 164 #### __init__ 165 166 ```python 167 __init__( 168 encoding: str = "utf-8", 169 store_full_path: bool = False, 170 *, 171 conversion_mode: Literal["file", "row"] = "file", 172 delimiter: str = ",", 173 quotechar: str = '"' 174 ) -> None 175 ``` 176 177 Creates a CSVToDocument component. 178 179 **Parameters:** 180 181 - **encoding** (<code>str</code>) – The encoding of the csv files to convert. 182 If the encoding is specified in the metadata of a source ByteStream, 183 it overrides this value. 184 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 185 If False, only the file name is stored. 186 - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text. 187 - "row": convert each CSV row to its own Document (requires `content_column` in `run()`). 188 - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`). 189 - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`). 190 191 #### run 192 193 ```python 194 run( 195 sources: list[str | Path | ByteStream], 196 *, 197 content_column: str | None = None, 198 meta: dict[str, Any] | list[dict[str, Any]] | None = None 199 ) -> dict[str, Any] 200 ``` 201 202 Converts CSV files to a Document (file mode) or to one Document per row (row mode). 203 204 **Parameters:** 205 206 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 207 - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`. 208 The column name whose values become `Document.content` for each row. 209 The column must exist in the CSV header. 210 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 211 This value can be either a list of dictionaries or a single dictionary. 212 If it's a single dictionary, its content is added to the metadata of all produced documents. 213 If it's a list, the length of the list must match the number of sources, because the two lists will 214 be zipped. 215 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 216 217 **Returns:** 218 219 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 220 - `documents`: Created documents 221 222 ## docx 223 224 ### DOCXMetadata 225 226 Describes the metadata of Docx file. 227 228 **Parameters:** 229 230 - **author** (<code>str</code>) – The author 231 - **category** (<code>str</code>) – The category 232 - **comments** (<code>str</code>) – The comments 233 - **content_status** (<code>str</code>) – The content status 234 - **created** (<code>str | None</code>) – The creation date (ISO formatted string) 235 - **identifier** (<code>str</code>) – The identifier 236 - **keywords** (<code>str</code>) – Available keywords 237 - **language** (<code>str</code>) – The language of the document 238 - **last_modified_by** (<code>str</code>) – User who last modified the document 239 - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string) 240 - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string) 241 - **revision** (<code>int</code>) – The revision number 242 - **subject** (<code>str</code>) – The subject 243 - **title** (<code>str</code>) – The title 244 - **version** (<code>str</code>) – The version 245 246 ### DOCXTableFormat 247 248 Bases: <code>Enum</code> 249 250 Supported formats for storing DOCX tabular data in a Document. 251 252 #### from_str 253 254 ```python 255 from_str(string: str) -> DOCXTableFormat 256 ``` 257 258 Convert a string to a DOCXTableFormat enum. 259 260 ### DOCXLinkFormat 261 262 Bases: <code>Enum</code> 263 264 Supported formats for storing DOCX link information in a Document. 265 266 #### from_str 267 268 ```python 269 from_str(string: str) -> DOCXLinkFormat 270 ``` 271 272 Convert a string to a DOCXLinkFormat enum. 273 274 ### DOCXToDocument 275 276 Converts DOCX files to Documents. 277 278 Uses `python-docx` library to convert the DOCX file to a document. 279 This component does not preserve page breaks in the original document. 280 281 Usage example: 282 283 ```python 284 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat 285 from datetime import datetime 286 287 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN) 288 results = converter.run( 289 sources=["test/test_files/docx/sample_docx.docx"], meta={"date_added": datetime.now().isoformat()} 290 ) 291 documents = results["documents"] 292 293 print(documents[0].content) 294 # >> 'This is a text from the DOCX file.' 295 ``` 296 297 #### __init__ 298 299 ```python 300 __init__( 301 table_format: str | DOCXTableFormat = DOCXTableFormat.CSV, 302 link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE, 303 store_full_path: bool = False, 304 ) -> None 305 ``` 306 307 Create a DOCXToDocument component. 308 309 **Parameters:** 310 311 - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN, 312 DOCXTableFormat.CSV, "markdown", or "csv". 313 - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either: 314 DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`, 315 DOCXLinkFormat.PLAIN or "plain" to get text (address), 316 DOCXLinkFormat.NONE or "none" to get text without links. 317 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 318 If False, only the file name is stored. 319 320 #### to_dict 321 322 ```python 323 to_dict() -> dict[str, Any] 324 ``` 325 326 Serializes the component to a dictionary. 327 328 **Returns:** 329 330 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 331 332 #### from_dict 333 334 ```python 335 from_dict(data: dict[str, Any]) -> DOCXToDocument 336 ``` 337 338 Deserializes the component from a dictionary. 339 340 **Parameters:** 341 342 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 343 344 **Returns:** 345 346 - <code>DOCXToDocument</code> – The deserialized component. 347 348 #### run 349 350 ```python 351 run( 352 sources: list[str | Path | ByteStream], 353 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 354 ) -> dict[str, Any] 355 ``` 356 357 Converts DOCX files to Documents. 358 359 **Parameters:** 360 361 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 362 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 363 This value can be either a list of dictionaries or a single dictionary. 364 If it's a single dictionary, its content is added to the metadata of all produced Documents. 365 If it's a list, the length of the list must match the number of sources, because the two lists will 366 be zipped. 367 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 368 369 **Returns:** 370 371 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 372 - `documents`: Created Documents 373 374 ## file_to_file_content 375 376 ### FileToFileContent 377 378 Converts files to FileContent objects to be included in ChatMessage objects. 379 380 ### Usage example 381 382 <!-- test-ignore --> 383 384 ```python 385 from haystack.components.converters import FileToFileContent 386 387 converter = FileToFileContent() 388 389 sources = ["document.pdf", "video.mp4"] 390 391 file_contents = converter.run(sources=sources)["file_contents"] 392 print(file_contents) 393 394 # [FileContent(base64_data='...', 395 # mime_type='application/pdf', 396 # filename='document.pdf', 397 # extra={}), 398 # ...] 399 ``` 400 401 #### run 402 403 ```python 404 run( 405 sources: list[str | Path | ByteStream], 406 *, 407 extra: dict[str, Any] | list[dict[str, Any]] | None = None 408 ) -> dict[str, list[FileContent]] 409 ``` 410 411 Converts files to FileContent objects. 412 413 **Parameters:** 414 415 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 416 - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific 417 information. 418 To avoid serialization issues, values should be JSON serializable. 419 This value can be a list of dictionaries or a single dictionary. 420 If it's a single dictionary, its content is added to the extra of all produced FileContent objects. 421 If it's a list, its length must match the number of sources as they're zipped together. 422 423 **Returns:** 424 425 - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys: 426 - `file_contents`: A list of FileContent objects. 427 428 ## html 429 430 ### HTMLToDocument 431 432 Converts an HTML file to a Document. 433 434 Usage example: 435 436 ```python 437 from haystack.components.converters import HTMLToDocument 438 439 converter = HTMLToDocument() 440 results = converter.run(sources=["test/test_files/html/paul_graham_superlinear.html"]) 441 documents = results["documents"] 442 443 print(documents[0].content) 444 # >> 'This is a text from the HTML file.' 445 ``` 446 447 #### __init__ 448 449 ```python 450 __init__( 451 extraction_kwargs: dict[str, Any] | None = None, 452 store_full_path: bool = False, 453 ) -> None 454 ``` 455 456 Create an HTMLToDocument component. 457 458 **Parameters:** 459 460 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These 461 are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see 462 the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 463 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 464 If False, only the file name is stored. 465 466 #### to_dict 467 468 ```python 469 to_dict() -> dict[str, Any] 470 ``` 471 472 Serializes the component to a dictionary. 473 474 **Returns:** 475 476 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 477 478 #### from_dict 479 480 ```python 481 from_dict(data: dict[str, Any]) -> HTMLToDocument 482 ``` 483 484 Deserializes the component from a dictionary. 485 486 **Parameters:** 487 488 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 489 490 **Returns:** 491 492 - <code>HTMLToDocument</code> – The deserialized component. 493 494 #### run 495 496 ```python 497 run( 498 sources: list[str | Path | ByteStream], 499 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 500 extraction_kwargs: dict[str, Any] | None = None, 501 ) -> dict[str, Any] 502 ``` 503 504 Converts a list of HTML files to Documents. 505 506 **Parameters:** 507 508 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 509 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 510 This value can be either a list of dictionaries or a single dictionary. 511 If it's a single dictionary, its content is added to the metadata of all produced Documents. 512 If it's a list, the length of the list must match the number of sources, because the two lists will 513 be zipped. 514 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 515 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process. 516 517 **Returns:** 518 519 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 520 - `documents`: Created Documents 521 522 ## image/document_to_image 523 524 ### DocumentToImageContent 525 526 Converts documents sourced from PDF and image files into ImageContents. 527 528 This component processes a list of documents and extracts visual content from supported file formats, converting 529 them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF 530 documents by extracting specific pages as images. 531 532 Documents are expected to have metadata containing: 533 534 - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path` 535 - A supported image format (MIME type must be one of the supported image types) 536 - For PDF files, a `page_number` key specifying which page to extract 537 538 ### Usage example 539 540 <!-- test-ignore --> 541 542 ```python 543 from haystack import Document 544 from haystack.components.converters.image.document_to_image import DocumentToImageContent 545 546 converter = DocumentToImageContent( 547 file_path_meta_field="file_path", 548 root_path="/data/files", 549 detail="high", 550 size=(800, 600) 551 ) 552 553 documents = [ 554 Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}), 555 Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1}) 556 ] 557 558 result = converter.run(documents) 559 image_contents = result["image_contents"] 560 # [ImageContent( 561 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'} 562 # ), 563 # ImageContent( 564 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', 565 # meta={'page_number': 1, 'file_path': 'doc.pdf'} 566 # )] 567 ``` 568 569 #### __init__ 570 571 ```python 572 __init__( 573 *, 574 file_path_meta_field: str = "file_path", 575 root_path: str | None = None, 576 detail: Literal["auto", "high", "low"] | None = None, 577 size: tuple[int, int] | None = None 578 ) -> None 579 ``` 580 581 Initialize the DocumentToImageContent component. 582 583 **Parameters:** 584 585 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 586 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 587 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 588 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 589 This will be passed to the created ImageContent objects. 590 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 591 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 592 when working with models that have resolution constraints or when transmitting images to remote services. 593 594 #### run 595 596 ```python 597 run(documents: list[Document]) -> dict[str, list[ImageContent | None]] 598 ``` 599 600 Convert documents with image or PDF sources into ImageContent objects. 601 602 This method processes the input documents, extracting images from supported file formats and converting them 603 into ImageContent objects. 604 605 **Parameters:** 606 607 - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum 608 a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which 609 page to convert. 610 611 **Returns:** 612 613 - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key: 614 - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image 615 data and metadata. The order corresponds to order of input documents. 616 617 **Raises:** 618 619 - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported 620 MIME type. The error message will specify which document and what information is missing or incorrect. 621 622 ## image/file_to_document 623 624 ### ImageFileToDocument 625 626 Converts image file references into empty Document objects with associated metadata. 627 628 This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be 629 processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`. 630 631 It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as 632 their content and attaches metadata such as file path and any user-provided values. 633 634 ### Usage example 635 636 ```python 637 from haystack.components.converters.image import ImageFileToDocument 638 639 converter = ImageFileToDocument() 640 641 sources = ["image.jpg", "another_image.png"] 642 643 result = converter.run(sources=sources) 644 documents = result["documents"] 645 646 print(documents) 647 648 # [Document(id=..., meta: {'file_path': 'image.jpg'}), 649 # Document(id=..., meta: {'file_path': 'another_image.png'})] 650 ``` 651 652 #### __init__ 653 654 ```python 655 __init__(*, store_full_path: bool = False) -> None 656 ``` 657 658 Initialize the ImageFileToDocument component. 659 660 **Parameters:** 661 662 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 663 If False, only the file name is stored. 664 665 #### run 666 667 ```python 668 run( 669 *, 670 sources: list[str | Path | ByteStream], 671 meta: dict[str, Any] | list[dict[str, Any]] | None = None 672 ) -> dict[str, list[Document]] 673 ``` 674 675 Convert image files into empty Document objects with metadata. 676 677 This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects 678 without content. These documents are enriched with metadata derived from the input source and optional 679 user-provided metadata. 680 681 **Parameters:** 682 683 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 684 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 685 This value can be a list of dictionaries or a single dictionary. 686 If it's a single dictionary, its content is added to the metadata of all produced documents. 687 If it's a list, its length must match the number of sources, as they are zipped together. 688 For ByteStream objects, their `meta` is added to the output documents. 689 690 **Returns:** 691 692 - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing: 693 - `documents`: A list of `Document` objects with empty content and associated metadata. 694 695 ## image/file_to_image 696 697 ### ImageFileToImageContent 698 699 Converts image files to ImageContent objects. 700 701 ### Usage example 702 703 ```python 704 from haystack.components.converters.image import ImageFileToImageContent 705 706 converter = ImageFileToImageContent() 707 708 sources = ["image.jpg", "another_image.png"] 709 710 image_contents = converter.run(sources=sources)["image_contents"] 711 print(image_contents) 712 713 # [ImageContent(base64_image='...', 714 # mime_type='image/jpeg', 715 # detail=None, 716 # meta={'file_path': 'image.jpg'}), 717 # ...] 718 ``` 719 720 #### __init__ 721 722 ```python 723 __init__( 724 *, 725 detail: Literal["auto", "high", "low"] | None = None, 726 size: tuple[int, int] | None = None 727 ) -> None 728 ``` 729 730 Create the ImageFileToImageContent component. 731 732 **Parameters:** 733 734 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 735 This will be passed to the created ImageContent objects. 736 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 737 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 738 when working with models that have resolution constraints or when transmitting images to remote services. 739 740 #### run 741 742 ```python 743 run( 744 sources: list[str | Path | ByteStream], 745 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 746 *, 747 detail: Literal["auto", "high", "low"] | None = None, 748 size: tuple[int, int] | None = None 749 ) -> dict[str, list[ImageContent]] 750 ``` 751 752 Converts files to ImageContent objects. 753 754 **Parameters:** 755 756 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 757 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 758 This value can be a list of dictionaries or a single dictionary. 759 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 760 If it's a list, its length must match the number of sources as they're zipped together. 761 For ByteStream objects, their `meta` is added to the output ImageContent objects. 762 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 763 This will be passed to the created ImageContent objects. 764 If not provided, the detail level will be the one set in the constructor. 765 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 766 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 767 when working with models that have resolution constraints or when transmitting images to remote services. 768 If not provided, the size value will be the one set in the constructor. 769 770 **Returns:** 771 772 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 773 - `image_contents`: A list of ImageContent objects. 774 775 ## image/pdf_to_image 776 777 ### PDFToImageContent 778 779 Converts PDF files to ImageContent objects. 780 781 ### Usage example 782 783 ```python 784 from haystack.components.converters.image import PDFToImageContent 785 786 converter = PDFToImageContent() 787 788 sources = ["file.pdf", "another_file.pdf"] 789 790 image_contents = converter.run(sources=sources)["image_contents"] 791 print(image_contents) 792 793 # [ImageContent(base64_image='...', 794 # mime_type='application/pdf', 795 # detail=None, 796 # meta={'file_path': 'file.pdf', 'page_number': 1}), 797 # ...] 798 ``` 799 800 #### __init__ 801 802 ```python 803 __init__( 804 *, 805 detail: Literal["auto", "high", "low"] | None = None, 806 size: tuple[int, int] | None = None, 807 page_range: list[str | int] | None = None 808 ) -> None 809 ``` 810 811 Create the PDFToImageContent component. 812 813 **Parameters:** 814 815 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 816 This will be passed to the created ImageContent objects. 817 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 818 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 819 when working with models that have resolution constraints or when transmitting images to remote services. 820 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 821 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 822 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 823 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 824 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 825 826 #### run 827 828 ```python 829 run( 830 sources: list[str | Path | ByteStream], 831 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 832 *, 833 detail: Literal["auto", "high", "low"] | None = None, 834 size: tuple[int, int] | None = None, 835 page_range: list[str | int] | None = None 836 ) -> dict[str, list[ImageContent]] 837 ``` 838 839 Converts files to ImageContent objects. 840 841 **Parameters:** 842 843 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 844 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 845 This value can be a list of dictionaries or a single dictionary. 846 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 847 If it's a list, its length must match the number of sources as they're zipped together. 848 For ByteStream objects, their `meta` is added to the output ImageContent objects. 849 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 850 This will be passed to the created ImageContent objects. 851 If not provided, the detail level will be the one set in the constructor. 852 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 853 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 854 when working with models that have resolution constraints or when transmitting images to remote services. 855 If not provided, the size value will be the one set in the constructor. 856 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 857 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 858 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 859 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 860 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 861 If not provided, the page_range value will be the one set in the constructor. 862 863 **Returns:** 864 865 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 866 - `image_contents`: A list of ImageContent objects. 867 868 ## json 869 870 ### JSONConverter 871 872 Converts one or more JSON files into a text document. 873 874 ### Usage examples 875 876 ```python 877 import json 878 879 from haystack.components.converters import JSONConverter 880 from haystack.dataclasses import ByteStream 881 882 source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"})) 883 884 converter = JSONConverter(content_key="text") 885 results = converter.run(sources=[source]) 886 documents = results["documents"] 887 print(documents[0].content) 888 # 'This is the content of my document' 889 ``` 890 891 Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields` 892 to extract from the filtered data: 893 894 ```python 895 import json 896 897 from haystack.components.converters import JSONConverter 898 from haystack.dataclasses import ByteStream 899 900 data = { 901 "laureates": [ 902 { 903 "firstname": "Enrico", 904 "surname": "Fermi", 905 "motivation": "for his demonstrations of the existence of new radioactive elements produced " 906 "by neutron irradiation, and for his related discovery of nuclear reactions brought about by" 907 " slow neutrons", 908 }, 909 { 910 "firstname": "Rita", 911 "surname": "Levi-Montalcini", 912 "motivation": "for their discoveries of growth factors", 913 }, 914 ], 915 } 916 source = ByteStream.from_string(json.dumps(data)) 917 converter = JSONConverter( 918 jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"} 919 ) 920 921 results = converter.run(sources=[source]) 922 documents = results["documents"] 923 print(documents[0].content) 924 # 'for his demonstrations of the existence of new radioactive elements produced by 925 # neutron irradiation, and for his related discovery of nuclear reactions brought 926 # about by slow neutrons' 927 928 print(documents[0].meta) 929 # {'firstname': 'Enrico', 'surname': 'Fermi'} 930 931 print(documents[1].content) 932 # 'for their discoveries of growth factors' 933 934 print(documents[1].meta) 935 # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'} 936 ``` 937 938 #### __init__ 939 940 ```python 941 __init__( 942 jq_schema: str | None = None, 943 content_key: str | None = None, 944 extra_meta_fields: set[str] | Literal["*"] | None = None, 945 store_full_path: bool = False, 946 ) -> None 947 ``` 948 949 Creates a JSONConverter component. 950 951 An optional `jq_schema` can be provided to extract nested data in the JSON source files. 952 See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax. 953 If `jq_schema` is not set, whole JSON source files will be used to extract content. 954 955 Optionally, you can provide a `content_key` to specify which key in the extracted object must 956 be set as the document's content. 957 958 If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in 959 the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped. 960 961 If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array, 962 it will be skipped. 963 964 If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped. 965 966 `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string. 967 If it's a set of strings, it must specify fields in the extracted objects that must be set in 968 the extracted documents. If a field is not found, the meta value will be `None`. 969 If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will 970 be saved as metadata. 971 972 Initialization will fail if neither `jq_schema` nor `content_key` are set. 973 974 **Parameters:** 975 976 - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content. 977 If not specified, whole JSON object will be used to extract information. 978 - **content_key** (<code>str | None</code>) – Optional key to extract document content. 979 If `jq_schema` is specified, the `content_key` will be extracted from that object. 980 - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content. 981 If `jq_schema` is specified, all keys will be extracted from that object. 982 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 983 If False, only the file name is stored. 984 985 #### to_dict 986 987 ```python 988 to_dict() -> dict[str, Any] 989 ``` 990 991 Serializes the component to a dictionary. 992 993 **Returns:** 994 995 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 996 997 #### from_dict 998 999 ```python 1000 from_dict(data: dict[str, Any]) -> JSONConverter 1001 ``` 1002 1003 Deserializes the component from a dictionary. 1004 1005 **Parameters:** 1006 1007 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1008 1009 **Returns:** 1010 1011 - <code>JSONConverter</code> – Deserialized component. 1012 1013 #### run 1014 1015 ```python 1016 run( 1017 sources: list[str | Path | ByteStream], 1018 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1019 ) -> dict[str, Any] 1020 ``` 1021 1022 Converts a list of JSON files to documents. 1023 1024 **Parameters:** 1025 1026 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects. 1027 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1028 This value can be either a list of dictionaries or a single dictionary. 1029 If it's a single dictionary, its content is added to the metadata of all produced documents. 1030 If it's a list, the length of the list must match the number of sources. 1031 If `sources` contain ByteStream objects, their `meta` will be added to the output documents. 1032 1033 **Returns:** 1034 1035 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1036 - `documents`: A list of created documents. 1037 1038 ## markdown 1039 1040 ### MarkdownToDocument 1041 1042 Converts a Markdown file into a text Document. 1043 1044 Usage example: 1045 1046 ```python 1047 from haystack.components.converters import MarkdownToDocument 1048 from datetime import datetime 1049 1050 converter = MarkdownToDocument() 1051 results = converter.run( 1052 sources=["test/test_files/markdown/sample.md"], meta={"date_added": datetime.now().isoformat()} 1053 ) 1054 documents = results["documents"] 1055 print(documents[0].content) 1056 # 'This is a text from the markdown file.' 1057 ``` 1058 1059 #### __init__ 1060 1061 ```python 1062 __init__( 1063 table_to_single_line: bool = False, 1064 progress_bar: bool = True, 1065 store_full_path: bool = False, 1066 ) -> None 1067 ``` 1068 1069 Create a MarkdownToDocument component. 1070 1071 **Parameters:** 1072 1073 - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line. 1074 - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running. 1075 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1076 If False, only the file name is stored. 1077 1078 #### run 1079 1080 ```python 1081 run( 1082 sources: list[str | Path | ByteStream], 1083 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1084 ) -> dict[str, Any] 1085 ``` 1086 1087 Converts a list of Markdown files to Documents. 1088 1089 **Parameters:** 1090 1091 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1092 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1093 This value can be either a list of dictionaries or a single dictionary. 1094 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1095 If it's a list, the length of the list must match the number of sources, because the two lists will 1096 be zipped. 1097 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1098 1099 **Returns:** 1100 1101 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1102 - `documents`: List of created Documents 1103 1104 ## msg 1105 1106 ### MSGToDocument 1107 1108 Converts Microsoft Outlook .msg files into Haystack Documents. 1109 1110 This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg 1111 files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg 1112 file are extracted as ByteStream objects. 1113 1114 ### Example Usage 1115 1116 ```python 1117 from haystack.components.converters.msg import MSGToDocument 1118 from datetime import datetime 1119 1120 converter = MSGToDocument() 1121 results = converter.run(sources=["test/test_files/msg/sample.msg"], meta={"date_added": datetime.now().isoformat()}) 1122 documents = results["documents"] 1123 attachments = results["attachments"] 1124 print(documents[0].content) 1125 ``` 1126 1127 #### __init__ 1128 1129 ```python 1130 __init__(store_full_path: bool = False) -> None 1131 ``` 1132 1133 Creates a MSGToDocument component. 1134 1135 **Parameters:** 1136 1137 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1138 If False, only the file name is stored. 1139 1140 #### run 1141 1142 ```python 1143 run( 1144 sources: list[str | Path | ByteStream], 1145 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1146 ) -> dict[str, list[Document] | list[ByteStream]] 1147 ``` 1148 1149 Converts MSG files to Documents. 1150 1151 **Parameters:** 1152 1153 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1154 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1155 This value can be either a list of dictionaries or a single dictionary. 1156 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1157 If it's a list, the length of the list must match the number of sources, because the two lists will 1158 be zipped. 1159 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1160 1161 **Returns:** 1162 1163 - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys: 1164 - `documents`: Created Documents. 1165 - `attachments`: Created ByteStream objects from file attachments. 1166 1167 ## multi_file_converter 1168 1169 ### MultiFileConverter 1170 1171 A file converter that handles conversion of multiple file types. 1172 1173 The MultiFileConverter handles the following file types: 1174 1175 - CSV 1176 - DOCX 1177 - HTML 1178 - JSON 1179 - MD 1180 - TEXT 1181 - PDF (no OCR) 1182 - PPTX 1183 - XLSX 1184 1185 Usage example: 1186 1187 ``` 1188 from haystack.super_components.converters import MultiFileConverter 1189 1190 converter = MultiFileConverter() 1191 converter.run(sources=["test/test_files/txt/doc_1.txt", "test/test_files/pdf/sample_pdf_1.pdf"], meta={}) 1192 ``` 1193 1194 #### __init__ 1195 1196 ```python 1197 __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None 1198 ``` 1199 1200 Initialize the MultiFileConverter. 1201 1202 **Parameters:** 1203 1204 - **encoding** (<code>str</code>) – The encoding to use when reading files. 1205 - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files. 1206 1207 ## openapi_functions 1208 1209 ### OpenAPIServiceToFunctions 1210 1211 Converts OpenAPI service definitions to a format suitable for OpenAI function calling. 1212 1213 The definition must respect OpenAPI specification 3.0.0 or higher. 1214 It can be specified in JSON or YAML format. 1215 Each function must have: 1216 \- unique operationId 1217 \- description 1218 \- requestBody and/or parameters 1219 \- schema for the requestBody and/or parameters 1220 For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification). 1221 For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling). 1222 1223 Usage example: 1224 1225 ```python 1226 from haystack.components.converters import OpenAPIServiceToFunctions 1227 from haystack.dataclasses.byte_stream import ByteStream 1228 1229 converter = OpenAPIServiceToFunctions() 1230 spec = ByteStream.from_string( 1231 '{"openapi":"3.0.0","info":{"title":"API","version":"1.0.0"},"paths":{"/search":{"get":{"operationId":"search","summary":"Search","parameters":[{"name":"q","in":"query","required":true,"schema":{"type":"string"}}]}}}}' 1232 ) 1233 result = converter.run(sources=[spec]) 1234 assert result["functions"] 1235 ``` 1236 1237 #### __init__ 1238 1239 ```python 1240 __init__() -> None 1241 ``` 1242 1243 Create an OpenAPIServiceToFunctions component. 1244 1245 #### run 1246 1247 ```python 1248 run(sources: list[str | Path | ByteStream]) -> dict[str, Any] 1249 ``` 1250 1251 Converts OpenAPI definitions in OpenAI function calling format. 1252 1253 **Parameters:** 1254 1255 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format). 1256 1257 **Returns:** 1258 1259 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1260 - functions: Function definitions in JSON object format 1261 - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references 1262 1263 **Raises:** 1264 1265 - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed. 1266 - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions. 1267 1268 ## output_adapter 1269 1270 ### OutputAdaptationException 1271 1272 Bases: <code>Exception</code> 1273 1274 Exception raised when there is an error during output adaptation. 1275 1276 ### OutputAdapter 1277 1278 Adapts output of a Component using Jinja templates. 1279 1280 Usage example: 1281 1282 ```python 1283 from haystack import Document 1284 from haystack.components.converters import OutputAdapter 1285 1286 adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str) 1287 documents = [Document(content="Test content")] 1288 result = adapter.run(documents=documents) 1289 1290 assert result["output"] == "Test content" 1291 ``` 1292 1293 #### __init__ 1294 1295 ```python 1296 __init__( 1297 template: str, 1298 output_type: TypeAlias, 1299 custom_filters: dict[str, Callable] | None = None, 1300 unsafe: bool = False, 1301 ) -> None 1302 ``` 1303 1304 Create an OutputAdapter component. 1305 1306 **Parameters:** 1307 1308 - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data. 1309 The variables in the template define the input of this instance. 1310 e.g. 1311 With this template: 1312 1313 ``` 1314 {{ documents[0].content }} 1315 ``` 1316 1317 The Component input will be `documents`. 1318 1319 - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return. 1320 - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template. 1321 - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template. 1322 This should only be used if you trust the source of the template as it can be lead to remote code execution. 1323 1324 #### run 1325 1326 ```python 1327 run(**kwargs: Any) -> dict[str, Any] 1328 ``` 1329 1330 Renders the Jinja template with the provided inputs. 1331 1332 **Parameters:** 1333 1334 - **kwargs** (<code>Any</code>) – Must contain all variables used in the `template` string. 1335 1336 **Returns:** 1337 1338 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1339 - `output`: Rendered Jinja template. 1340 1341 **Raises:** 1342 1343 - <code>OutputAdaptationException</code> – If template rendering fails. 1344 1345 #### to_dict 1346 1347 ```python 1348 to_dict() -> dict[str, Any] 1349 ``` 1350 1351 Serializes the component to a dictionary. 1352 1353 **Returns:** 1354 1355 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1356 1357 #### from_dict 1358 1359 ```python 1360 from_dict(data: dict[str, Any]) -> OutputAdapter 1361 ``` 1362 1363 Deserializes the component from a dictionary. 1364 1365 **Parameters:** 1366 1367 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 1368 1369 **Returns:** 1370 1371 - <code>OutputAdapter</code> – The deserialized component. 1372 1373 ## pdfminer 1374 1375 ### PDFMinerToDocument 1376 1377 Converts PDF files to Documents. 1378 1379 Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/ 1380 1381 Usage example: 1382 1383 ```python 1384 from haystack.components.converters.pdfminer import PDFMinerToDocument 1385 from datetime import datetime 1386 1387 converter = PDFMinerToDocument() 1388 results = converter.run( 1389 sources=["test/test_files/pdf/sample_pdf_1.pdf"], meta={"date_added": datetime.now().isoformat()} 1390 ) 1391 1392 print(results["documents"][0].content) 1393 # >> 'This is a text from the PDF file.' 1394 ``` 1395 1396 #### __init__ 1397 1398 ```python 1399 __init__( 1400 line_overlap: float = 0.5, 1401 char_margin: float = 2.0, 1402 line_margin: float = 0.5, 1403 word_margin: float = 0.1, 1404 boxes_flow: float | None = 0.5, 1405 detect_vertical: bool = True, 1406 all_texts: bool = False, 1407 store_full_path: bool = False, 1408 ) -> None 1409 ``` 1410 1411 Create a PDFMinerToDocument component. 1412 1413 **Parameters:** 1414 1415 - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on 1416 the same line based on the amount of overlap between them. 1417 The overlap is calculated relative to the minimum height of both characters. 1418 - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them. 1419 If the distance is less than the margin specified, the characters are considered to be on the same line. 1420 The margin is calculated relative to the width of the character. 1421 - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word 1422 based on the distance between them. If the distance is greater than the margin specified, 1423 an intermediate space will be added between them to make the text more readable. 1424 The margin is calculated relative to the width of the character. 1425 - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on 1426 the distance between them. If the distance is less than the margin specified, 1427 the lines are considered to be part of the same paragraph. 1428 The margin is calculated relative to the height of a line. 1429 - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when 1430 determining the order of text boxes. A value between -1.0 and +1.0 can be set, 1431 with -1.0 indicating that only horizontal position matters and +1.0 indicating 1432 that only vertical position matters. Setting the value to 'None' will disable advanced 1433 layout analysis, and text boxes will be ordered based on the position of their bottom left corner. 1434 - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis. 1435 - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures. 1436 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1437 If False, only the file name is stored. 1438 1439 #### detect_undecoded_cid_characters 1440 1441 ```python 1442 detect_undecoded_cid_characters(text: str) -> dict[str, Any] 1443 ``` 1444 1445 Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format. 1446 1447 This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses 1448 non-standard fonts. 1449 1450 A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like 1451 searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor 1452 needs. If that map is not available the text extractor cannot decode the CID characters and will return them 1453 as is. 1454 1455 see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output 1456 1457 **Parameters:** 1458 1459 - **text** (<code>str</code>) – The text to check for undecoded CID characters 1460 1461 **Returns:** 1462 1463 - <code>dict\[str, Any\]</code> – A dictionary containing detection results 1464 1465 #### run 1466 1467 ```python 1468 run( 1469 sources: list[str | Path | ByteStream], 1470 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1471 ) -> dict[str, Any] 1472 ``` 1473 1474 Converts PDF files to Documents. 1475 1476 **Parameters:** 1477 1478 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects. 1479 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1480 This value can be either a list of dictionaries or a single dictionary. 1481 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1482 If it's a list, the length of the list must match the number of sources, because the two lists will 1483 be zipped. 1484 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1485 1486 **Returns:** 1487 1488 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1489 - `documents`: Created Documents 1490 1491 ## pptx 1492 1493 ### PPTXToDocument 1494 1495 Converts PPTX files to Documents. 1496 1497 Usage example: 1498 1499 ```python 1500 from haystack.components.converters.pptx import PPTXToDocument 1501 from datetime import datetime 1502 1503 converter = PPTXToDocument() 1504 results = converter.run( 1505 sources=["test/test_files/pptx/sample_pptx.pptx"], meta={"date_added": datetime.now().isoformat()} 1506 ) 1507 documents = results["documents"] 1508 1509 print(documents[0].content) 1510 # >> 'This is the text from the PPTX file.' 1511 ``` 1512 1513 #### __init__ 1514 1515 ```python 1516 __init__( 1517 store_full_path: bool = False, 1518 link_format: Literal["markdown", "plain", "none"] = "none", 1519 ) -> None 1520 ``` 1521 1522 Create a PPTXToDocument component. 1523 1524 **Parameters:** 1525 1526 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1527 If False, only the file name is stored. 1528 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1529 - `"markdown"`: `[text](url)` 1530 - `"plain"`: `text (url)` 1531 - `"none"`: Only the text is extracted, link addresses are ignored. 1532 1533 #### to_dict 1534 1535 ```python 1536 to_dict() -> dict[str, Any] 1537 ``` 1538 1539 Serializes the component to a dictionary. 1540 1541 **Returns:** 1542 1543 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1544 1545 #### run 1546 1547 ```python 1548 run( 1549 sources: list[str | Path | ByteStream], 1550 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1551 ) -> dict[str, Any] 1552 ``` 1553 1554 Converts PPTX files to Documents. 1555 1556 **Parameters:** 1557 1558 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1559 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1560 This value can be either a list of dictionaries or a single dictionary. 1561 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1562 If it's a list, the length of the list must match the number of sources, because the two lists will 1563 be zipped. 1564 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1565 1566 **Returns:** 1567 1568 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1569 - `documents`: Created Documents 1570 1571 ## pypdf 1572 1573 ### PyPDFExtractionMode 1574 1575 Bases: <code>Enum</code> 1576 1577 The mode to use for extracting text from a PDF. 1578 1579 #### from_str 1580 1581 ```python 1582 from_str(string: str) -> PyPDFExtractionMode 1583 ``` 1584 1585 Convert a string to a PyPDFExtractionMode enum. 1586 1587 ### PyPDFToDocument 1588 1589 Converts PDF files to documents your pipeline can query. 1590 1591 This component uses the PyPDF library. 1592 You can attach metadata to the resulting documents. 1593 1594 ### Usage example 1595 1596 ```python 1597 from haystack.components.converters.pypdf import PyPDFToDocument 1598 from datetime import datetime 1599 1600 converter = PyPDFToDocument() 1601 results = converter.run( 1602 sources=["test/test_files/pdf/sample_pdf_1.pdf"], meta={"date_added": datetime.now().isoformat()} 1603 ) 1604 documents = results["documents"] 1605 1606 print(documents[0].content) 1607 # >> 'This is a text from the PDF file.' 1608 ``` 1609 1610 #### __init__ 1611 1612 ```python 1613 __init__( 1614 *, 1615 extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN, 1616 plain_mode_orientations: tuple = (0, 90, 180, 270), 1617 plain_mode_space_width: float = 200.0, 1618 layout_mode_space_vertically: bool = True, 1619 layout_mode_scale_weight: float = 1.25, 1620 layout_mode_strip_rotated: bool = True, 1621 layout_mode_font_height_weight: float = 1.0, 1622 store_full_path: bool = False 1623 ) -> None 1624 ``` 1625 1626 Create an PyPDFToDocument component. 1627 1628 **Parameters:** 1629 1630 - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF. 1631 Layout mode is an experimental mode that adheres to the rendered layout of the PDF. 1632 - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode. 1633 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1634 - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font. 1635 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1636 - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height. 1637 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1638 - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width. 1639 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1640 - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway. 1641 If rotated text is discovered, layout will be degraded and a warning will be logged. 1642 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1643 - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height. 1644 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1645 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1646 If False, only the file name is stored. 1647 1648 #### to_dict 1649 1650 ```python 1651 to_dict() -> dict[str, Any] 1652 ``` 1653 1654 Serializes the component to a dictionary. 1655 1656 **Returns:** 1657 1658 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1659 1660 #### from_dict 1661 1662 ```python 1663 from_dict(data: dict[str, Any]) -> PyPDFToDocument 1664 ``` 1665 1666 Deserializes the component from a dictionary. 1667 1668 **Parameters:** 1669 1670 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 1671 1672 **Returns:** 1673 1674 - <code>PyPDFToDocument</code> – Deserialized component. 1675 1676 #### run 1677 1678 ```python 1679 run( 1680 sources: list[str | Path | ByteStream], 1681 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1682 ) -> dict[str, list[Document]] 1683 ``` 1684 1685 Converts PDF files to documents. 1686 1687 **Parameters:** 1688 1689 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 1690 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1691 This value can be a list of dictionaries or a single dictionary. 1692 If it's a single dictionary, its content is added to the metadata of all produced documents. 1693 If it's a list, its length must match the number of sources, as they are zipped together. 1694 For ByteStream objects, their `meta` is added to the output documents. 1695 1696 **Returns:** 1697 1698 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1699 - `documents`: A list of converted documents. 1700 1701 ## tika 1702 1703 ### XHTMLParser 1704 1705 Bases: <code>HTMLParser</code> 1706 1707 Custom parser to extract pages from Tika XHTML content. 1708 1709 #### handle_starttag 1710 1711 ```python 1712 handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None 1713 ``` 1714 1715 Identify the start of a page div. 1716 1717 #### handle_endtag 1718 1719 ```python 1720 handle_endtag(tag: str) -> None 1721 ``` 1722 1723 Identify the end of a page div. 1724 1725 #### handle_data 1726 1727 ```python 1728 handle_data(data: str) -> None 1729 ``` 1730 1731 Populate the page content. 1732 1733 ### TikaDocumentConverter 1734 1735 Converts files of different types to Documents. 1736 1737 This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, 1738 requires a running Tika server. 1739 For more options on running Tika, 1740 see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). 1741 1742 Usage example: 1743 1744 <!-- test-ignore --> 1745 1746 ```python 1747 from haystack.components.converters.tika import TikaDocumentConverter 1748 from datetime import datetime 1749 1750 converter = TikaDocumentConverter() 1751 results = converter.run( 1752 sources=["sample.docx", "my_document.rtf", "archive.zip"], 1753 meta={"date_added": datetime.now().isoformat()} 1754 ) 1755 documents = results["documents"] 1756 1757 print(documents[0].content) 1758 # >> 'This is a text from the docx file.' 1759 ``` 1760 1761 #### __init__ 1762 1763 ```python 1764 __init__( 1765 tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False 1766 ) -> None 1767 ``` 1768 1769 Create a TikaDocumentConverter component. 1770 1771 **Parameters:** 1772 1773 - **tika_url** (<code>str</code>) – Tika server URL. 1774 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1775 If False, only the file name is stored. 1776 1777 #### run 1778 1779 ```python 1780 run( 1781 sources: list[str | Path | ByteStream], 1782 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1783 ) -> dict[str, list[Document]] 1784 ``` 1785 1786 Converts files to Documents. 1787 1788 **Parameters:** 1789 1790 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 1791 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1792 This value can be either a list of dictionaries or a single dictionary. 1793 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1794 If it's a list, the length of the list must match the number of sources, because the two lists will 1795 be zipped. 1796 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1797 1798 **Returns:** 1799 1800 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1801 - `documents`: Created Documents 1802 1803 ## txt 1804 1805 ### TextFileToDocument 1806 1807 Converts text files to documents your pipeline can query. 1808 1809 By default, it uses UTF-8 encoding when converting files but 1810 you can also set custom encoding. 1811 It can attach metadata to the resulting documents. 1812 1813 ### Usage example 1814 1815 ```python 1816 from haystack.components.converters.txt import TextFileToDocument 1817 1818 converter = TextFileToDocument() 1819 results = converter.run(sources=["test/test_files/txt/doc_1.txt"]) 1820 documents = results["documents"] 1821 1822 print(documents[0].content) 1823 # >> 'This is the content from the txt file.' 1824 ``` 1825 1826 #### __init__ 1827 1828 ```python 1829 __init__(encoding: str = 'utf-8', store_full_path: bool = False) -> None 1830 ``` 1831 1832 Creates a TextFileToDocument component. 1833 1834 **Parameters:** 1835 1836 - **encoding** (<code>str</code>) – The encoding of the text files to convert. 1837 If the encoding is specified in the metadata of a source ByteStream, 1838 it overrides this value. 1839 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1840 If False, only the file name is stored. 1841 1842 #### run 1843 1844 ```python 1845 run( 1846 sources: list[str | Path | ByteStream], 1847 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1848 ) -> dict[str, list[Document]] 1849 ``` 1850 1851 Converts text files to documents. 1852 1853 **Parameters:** 1854 1855 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert. 1856 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1857 This value can be a list of dictionaries or a single dictionary. 1858 If it's a single dictionary, its content is added to the metadata of all produced documents. 1859 If it's a list, its length must match the number of sources as they're zipped together. 1860 For ByteStream objects, their `meta` is added to the output documents. 1861 1862 **Returns:** 1863 1864 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1865 - `documents`: A list of converted documents. 1866 1867 ## xlsx 1868 1869 ### XLSXToDocument 1870 1871 Converts XLSX (Excel) files into Documents. 1872 1873 Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is 1874 created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format. 1875 1876 ### Usage example 1877 1878 ```python 1879 from haystack.components.converters.xlsx import XLSXToDocument 1880 from datetime import datetime 1881 1882 converter = XLSXToDocument() 1883 results = converter.run( 1884 sources=["test/test_files/xlsx/basic_tables_two_sheets.xlsx"], meta={"date_added": datetime.now().isoformat()} 1885 ) 1886 documents = results["documents"] 1887 1888 print(documents[0].content) 1889 # >> ",A,B\n1,col_a,col_b\n2,1.5,test\n" 1890 ``` 1891 1892 #### __init__ 1893 1894 ```python 1895 __init__( 1896 table_format: Literal["csv", "markdown"] = "csv", 1897 sheet_name: str | int | list[str | int] | None = None, 1898 read_excel_kwargs: dict[str, Any] | None = None, 1899 table_format_kwargs: dict[str, Any] | None = None, 1900 *, 1901 link_format: Literal["markdown", "plain", "none"] = "none", 1902 store_full_path: bool = False 1903 ) -> None 1904 ``` 1905 1906 Creates a XLSXToDocument component. 1907 1908 **Parameters:** 1909 1910 - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to. 1911 - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read. 1912 - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`. 1913 See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel 1914 - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function. 1915 - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`. 1916 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv 1917 - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`. 1918 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown 1919 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1920 - `"markdown"`: `[text](url)` 1921 - `"plain"`: `text (url)` 1922 - `"none"`: Only the text is extracted, link addresses are ignored. 1923 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1924 If False, only the file name is stored. 1925 1926 #### run 1927 1928 ```python 1929 run( 1930 sources: list[str | Path | ByteStream], 1931 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1932 ) -> dict[str, list[Document]] 1933 ``` 1934 1935 Converts a XLSX file to a Document. 1936 1937 **Parameters:** 1938 1939 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1940 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1941 This value can be either a list of dictionaries or a single dictionary. 1942 If it's a single dictionary, its content is added to the metadata of all produced documents. 1943 If it's a list, the length of the list must match the number of sources, because the two lists will 1944 be zipped. 1945 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 1946 1947 **Returns:** 1948 1949 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1950 - `documents`: Created documents