converters_api.md
1 --- 2 title: "Converters" 3 id: converters-api 4 description: "Various converters to transform data from one format to another." 5 slug: "/converters-api" 6 --- 7 8 9 ## azure 10 11 ### AzureOCRDocumentConverter 12 13 Converts files to documents using Azure's Document Intelligence service. 14 15 Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 16 17 To use this component, you need an active Azure account 18 and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see 19 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 20 21 ### Usage example 22 23 ```python 24 import os 25 from datetime import datetime 26 from haystack.components.converters import AzureOCRDocumentConverter 27 from haystack.utils import Secret 28 29 converter = AzureOCRDocumentConverter( 30 endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"], 31 api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"), 32 ) 33 results = converter.run( 34 sources=["test/test_files/pdf/react_paper.pdf"], 35 meta={"date_added": datetime.now().isoformat()}, 36 ) 37 documents = results["documents"] 38 print(documents[0].content) 39 # 'This is a text from the PDF file.' 40 ``` 41 42 #### __init__ 43 44 ```python 45 __init__( 46 endpoint: str, 47 api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), 48 model_id: str = "prebuilt-read", 49 preceding_context_len: int = 3, 50 following_context_len: int = 3, 51 merge_multiple_column_headers: bool = True, 52 page_layout: Literal["natural", "single_column"] = "natural", 53 threshold_y: float | None = 0.05, 54 store_full_path: bool = False, 55 ) 56 ``` 57 58 Creates an AzureOCRDocumentConverter component. 59 60 **Parameters:** 61 62 - **endpoint** (<code>str</code>) – The endpoint of your Azure resource. 63 - **api_key** (<code>Secret</code>) – The API key of your Azure resource. 64 - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation] 65 (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). 66 - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context 67 (this will be added to the metadata). 68 - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context ( 69 this will be added to the metadata). 70 - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row. 71 - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options: 72 - `natural`: Uses the natural reading order determined by Azure. 73 - `single_column`: Groups all lines with the same height on the page based on a threshold 74 determined by `threshold_y`. 75 - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`. 76 The threshold, in inches, to determine if two recognized PDF elements are grouped into a 77 single line. This is crucial for section headers or numbers which may be spatially separated 78 from the remaining text on the horizontal axis. 79 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 80 If False, only the file name is stored. 81 82 #### run 83 84 ```python 85 run( 86 sources: list[str | Path | ByteStream], 87 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 88 ) 89 ``` 90 91 Convert a list of files to Documents using Azure's Document Intelligence service. 92 93 **Parameters:** 94 95 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 96 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 97 This value can be either a list of dictionaries or a single dictionary. 98 If it's a single dictionary, its content is added to the metadata of all produced Documents. 99 If it's a list, the length of the list must match the number of sources, because the two lists will be 100 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 101 102 **Returns:** 103 104 - – A dictionary with the following keys: 105 - `documents`: List of created Documents 106 - `raw_azure_response`: List of raw Azure responses used to create the Documents 107 108 #### to_dict 109 110 ```python 111 to_dict() -> dict[str, Any] 112 ``` 113 114 Serializes the component to a dictionary. 115 116 **Returns:** 117 118 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 119 120 #### from_dict 121 122 ```python 123 from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter 124 ``` 125 126 Deserializes the component from a dictionary. 127 128 **Parameters:** 129 130 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 131 132 **Returns:** 133 134 - <code>AzureOCRDocumentConverter</code> – The deserialized component. 135 136 ## csv 137 138 ### CSVToDocument 139 140 Converts CSV files to Documents. 141 142 By default, it uses UTF-8 encoding when converting files but 143 you can also set a custom encoding. 144 It can attach metadata to the resulting documents. 145 146 ### Usage example 147 148 ```python 149 from haystack.components.converters.csv import CSVToDocument 150 converter = CSVToDocument() 151 results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()}) 152 documents = results["documents"] 153 print(documents[0].content) 154 # 'col1,col2\nrow1,row1\nrow2,row2\n' 155 ``` 156 157 #### __init__ 158 159 ```python 160 __init__( 161 encoding: str = "utf-8", 162 store_full_path: bool = False, 163 *, 164 conversion_mode: Literal["file", "row"] = "file", 165 delimiter: str = ",", 166 quotechar: str = '"' 167 ) 168 ``` 169 170 Creates a CSVToDocument component. 171 172 **Parameters:** 173 174 - **encoding** (<code>str</code>) – The encoding of the csv files to convert. 175 If the encoding is specified in the metadata of a source ByteStream, 176 it overrides this value. 177 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 178 If False, only the file name is stored. 179 - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text. 180 - "row": convert each CSV row to its own Document (requires `content_column` in `run()`). 181 - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`). 182 - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`). 183 184 #### run 185 186 ```python 187 run( 188 sources: list[str | Path | ByteStream], 189 *, 190 content_column: str | None = None, 191 meta: dict[str, Any] | list[dict[str, Any]] | None = None 192 ) 193 ``` 194 195 Converts CSV files to a Document (file mode) or to one Document per row (row mode). 196 197 **Parameters:** 198 199 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 200 - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`. 201 The column name whose values become `Document.content` for each row. 202 The column must exist in the CSV header. 203 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 204 This value can be either a list of dictionaries or a single dictionary. 205 If it's a single dictionary, its content is added to the metadata of all produced documents. 206 If it's a list, the length of the list must match the number of sources, because the two lists will 207 be zipped. 208 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 209 210 **Returns:** 211 212 - – A dictionary with the following keys: 213 - `documents`: Created documents 214 215 ## docx 216 217 ### DOCXMetadata 218 219 Describes the metadata of Docx file. 220 221 **Parameters:** 222 223 - **author** (<code>str</code>) – The author 224 - **category** (<code>str</code>) – The category 225 - **comments** (<code>str</code>) – The comments 226 - **content_status** (<code>str</code>) – The content status 227 - **created** (<code>str | None</code>) – The creation date (ISO formatted string) 228 - **identifier** (<code>str</code>) – The identifier 229 - **keywords** (<code>str</code>) – Available keywords 230 - **language** (<code>str</code>) – The language of the document 231 - **last_modified_by** (<code>str</code>) – User who last modified the document 232 - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string) 233 - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string) 234 - **revision** (<code>int</code>) – The revision number 235 - **subject** (<code>str</code>) – The subject 236 - **title** (<code>str</code>) – The title 237 - **version** (<code>str</code>) – The version 238 239 ### DOCXTableFormat 240 241 Bases: <code>Enum</code> 242 243 Supported formats for storing DOCX tabular data in a Document. 244 245 #### from_str 246 247 ```python 248 from_str(string: str) -> DOCXTableFormat 249 ``` 250 251 Convert a string to a DOCXTableFormat enum. 252 253 ### DOCXLinkFormat 254 255 Bases: <code>Enum</code> 256 257 Supported formats for storing DOCX link information in a Document. 258 259 #### from_str 260 261 ```python 262 from_str(string: str) -> DOCXLinkFormat 263 ``` 264 265 Convert a string to a DOCXLinkFormat enum. 266 267 ### DOCXToDocument 268 269 Converts DOCX files to Documents. 270 271 Uses `python-docx` library to convert the DOCX file to a document. 272 This component does not preserve page breaks in the original document. 273 274 Usage example: 275 276 ```python 277 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat 278 279 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN) 280 results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()}) 281 documents = results["documents"] 282 print(documents[0].content) 283 # 'This is a text from the DOCX file.' 284 ``` 285 286 #### __init__ 287 288 ```python 289 __init__( 290 table_format: str | DOCXTableFormat = DOCXTableFormat.CSV, 291 link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE, 292 store_full_path: bool = False, 293 ) 294 ``` 295 296 Create a DOCXToDocument component. 297 298 **Parameters:** 299 300 - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN, 301 DOCXTableFormat.CSV, "markdown", or "csv". 302 - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either: 303 DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`, 304 DOCXLinkFormat.PLAIN or "plain" to get text (address), 305 DOCXLinkFormat.NONE or "none" to get text without links. 306 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 307 If False, only the file name is stored. 308 309 #### to_dict 310 311 ```python 312 to_dict() -> dict[str, Any] 313 ``` 314 315 Serializes the component to a dictionary. 316 317 **Returns:** 318 319 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 320 321 #### from_dict 322 323 ```python 324 from_dict(data: dict[str, Any]) -> DOCXToDocument 325 ``` 326 327 Deserializes the component from a dictionary. 328 329 **Parameters:** 330 331 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 332 333 **Returns:** 334 335 - <code>DOCXToDocument</code> – The deserialized component. 336 337 #### run 338 339 ```python 340 run( 341 sources: list[str | Path | ByteStream], 342 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 343 ) 344 ``` 345 346 Converts DOCX files to Documents. 347 348 **Parameters:** 349 350 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 351 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 352 This value can be either a list of dictionaries or a single dictionary. 353 If it's a single dictionary, its content is added to the metadata of all produced Documents. 354 If it's a list, the length of the list must match the number of sources, because the two lists will 355 be zipped. 356 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 357 358 **Returns:** 359 360 - – A dictionary with the following keys: 361 - `documents`: Created Documents 362 363 ## file_to_file_content 364 365 ### FileToFileContent 366 367 Converts files to FileContent objects to be included in ChatMessage objects. 368 369 ### Usage example 370 371 ```python 372 from haystack.components.converters import FileToFileContent 373 374 converter = FileToFileContent() 375 376 sources = ["document.pdf", "video.mp4"] 377 378 file_contents = converter.run(sources=sources)["file_contents"] 379 print(file_contents) 380 381 # [FileContent(base64_data='...', 382 # mime_type='application/pdf', 383 # filename='document.pdf', 384 # extra={}), 385 # ...] 386 ``` 387 388 #### run 389 390 ```python 391 run( 392 sources: list[str | Path | ByteStream], 393 *, 394 extra: dict[str, Any] | list[dict[str, Any]] | None = None 395 ) -> dict[str, list[FileContent]] 396 ``` 397 398 Converts files to FileContent objects. 399 400 **Parameters:** 401 402 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 403 - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific 404 information. 405 To avoid serialization issues, values should be JSON serializable. 406 This value can be a list of dictionaries or a single dictionary. 407 If it's a single dictionary, its content is added to the extra of all produced FileContent objects. 408 If it's a list, its length must match the number of sources as they're zipped together. 409 410 **Returns:** 411 412 - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys: 413 - `file_contents`: A list of FileContent objects. 414 415 ## html 416 417 ### HTMLToDocument 418 419 Converts an HTML file to a Document. 420 421 Usage example: 422 423 ```python 424 from haystack.components.converters import HTMLToDocument 425 426 converter = HTMLToDocument() 427 results = converter.run(sources=["path/to/sample.html"]) 428 documents = results["documents"] 429 print(documents[0].content) 430 # 'This is a text from the HTML file.' 431 ``` 432 433 #### __init__ 434 435 ```python 436 __init__( 437 extraction_kwargs: dict[str, Any] | None = None, 438 store_full_path: bool = False, 439 ) 440 ``` 441 442 Create an HTMLToDocument component. 443 444 **Parameters:** 445 446 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These 447 are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see 448 the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 449 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 450 If False, only the file name is stored. 451 452 #### to_dict 453 454 ```python 455 to_dict() -> dict[str, Any] 456 ``` 457 458 Serializes the component to a dictionary. 459 460 **Returns:** 461 462 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 463 464 #### from_dict 465 466 ```python 467 from_dict(data: dict[str, Any]) -> HTMLToDocument 468 ``` 469 470 Deserializes the component from a dictionary. 471 472 **Parameters:** 473 474 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 475 476 **Returns:** 477 478 - <code>HTMLToDocument</code> – The deserialized component. 479 480 #### run 481 482 ```python 483 run( 484 sources: list[str | Path | ByteStream], 485 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 486 extraction_kwargs: dict[str, Any] | None = None, 487 ) 488 ``` 489 490 Converts a list of HTML files to Documents. 491 492 **Parameters:** 493 494 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 495 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 496 This value can be either a list of dictionaries or a single dictionary. 497 If it's a single dictionary, its content is added to the metadata of all produced Documents. 498 If it's a list, the length of the list must match the number of sources, because the two lists will 499 be zipped. 500 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 501 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process. 502 503 **Returns:** 504 505 - – A dictionary with the following keys: 506 - `documents`: Created Documents 507 508 ## image/document_to_image 509 510 ### DocumentToImageContent 511 512 Converts documents sourced from PDF and image files into ImageContents. 513 514 This component processes a list of documents and extracts visual content from supported file formats, converting 515 them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF 516 documents by extracting specific pages as images. 517 518 Documents are expected to have metadata containing: 519 520 - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path` 521 - A supported image format (MIME type must be one of the supported image types) 522 - For PDF files, a `page_number` key specifying which page to extract 523 524 ### Usage example 525 526 ```` 527 ```python 528 from haystack import Document 529 from haystack.components.converters.image.document_to_image import DocumentToImageContent 530 531 converter = DocumentToImageContent( 532 file_path_meta_field="file_path", 533 root_path="/data/files", 534 detail="high", 535 size=(800, 600) 536 ) 537 538 documents = [ 539 Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}), 540 Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1}) 541 ] 542 543 result = converter.run(documents) 544 image_contents = result["image_contents"] 545 # [ImageContent( 546 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'} 547 # ), 548 # ImageContent( 549 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', 550 # meta={'page_number': 1, 'file_path': 'doc.pdf'} 551 # )] 552 ``` 553 ```` 554 555 #### __init__ 556 557 ```python 558 __init__( 559 *, 560 file_path_meta_field: str = "file_path", 561 root_path: str | None = None, 562 detail: Literal["auto", "high", "low"] | None = None, 563 size: tuple[int, int] | None = None 564 ) 565 ``` 566 567 Initialize the DocumentToImageContent component. 568 569 **Parameters:** 570 571 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 572 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 573 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 574 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 575 This will be passed to the created ImageContent objects. 576 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 577 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 578 when working with models that have resolution constraints or when transmitting images to remote services. 579 580 #### run 581 582 ```python 583 run(documents: list[Document]) -> dict[str, list[ImageContent | None]] 584 ``` 585 586 Convert documents with image or PDF sources into ImageContent objects. 587 588 This method processes the input documents, extracting images from supported file formats and converting them 589 into ImageContent objects. 590 591 **Parameters:** 592 593 - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum 594 a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which 595 page to convert. 596 597 **Returns:** 598 599 - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key: 600 - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image 601 data and metadata. The order corresponds to order of input documents. 602 603 **Raises:** 604 605 - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported 606 MIME type. The error message will specify which document and what information is missing or incorrect. 607 608 ## image/file_to_document 609 610 ### ImageFileToDocument 611 612 Converts image file references into empty Document objects with associated metadata. 613 614 This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be 615 processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`. 616 617 It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as 618 their content and attaches metadata such as file path and any user-provided values. 619 620 ### Usage example 621 622 ```python 623 from haystack.components.converters.image import ImageFileToDocument 624 625 converter = ImageFileToDocument() 626 627 sources = ["image.jpg", "another_image.png"] 628 629 result = converter.run(sources=sources) 630 documents = result["documents"] 631 632 print(documents) 633 634 # [Document(id=..., meta: {'file_path': 'image.jpg'}), 635 # Document(id=..., meta: {'file_path': 'another_image.png'})] 636 ``` 637 638 #### __init__ 639 640 ```python 641 __init__(*, store_full_path: bool = False) 642 ``` 643 644 Initialize the ImageFileToDocument component. 645 646 **Parameters:** 647 648 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 649 If False, only the file name is stored. 650 651 #### run 652 653 ```python 654 run( 655 *, 656 sources: list[str | Path | ByteStream], 657 meta: dict[str, Any] | list[dict[str, Any]] | None = None 658 ) -> dict[str, list[Document]] 659 ``` 660 661 Convert image files into empty Document objects with metadata. 662 663 This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects 664 without content. These documents are enriched with metadata derived from the input source and optional 665 user-provided metadata. 666 667 **Parameters:** 668 669 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 670 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 671 This value can be a list of dictionaries or a single dictionary. 672 If it's a single dictionary, its content is added to the metadata of all produced documents. 673 If it's a list, its length must match the number of sources, as they are zipped together. 674 For ByteStream objects, their `meta` is added to the output documents. 675 676 **Returns:** 677 678 - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing: 679 - `documents`: A list of `Document` objects with empty content and associated metadata. 680 681 ## image/file_to_image 682 683 ### ImageFileToImageContent 684 685 Converts image files to ImageContent objects. 686 687 ### Usage example 688 689 ```python 690 from haystack.components.converters.image import ImageFileToImageContent 691 692 converter = ImageFileToImageContent() 693 694 sources = ["image.jpg", "another_image.png"] 695 696 image_contents = converter.run(sources=sources)["image_contents"] 697 print(image_contents) 698 699 # [ImageContent(base64_image='...', 700 # mime_type='image/jpeg', 701 # detail=None, 702 # meta={'file_path': 'image.jpg'}), 703 # ...] 704 ``` 705 706 #### __init__ 707 708 ```python 709 __init__( 710 *, 711 detail: Literal["auto", "high", "low"] | None = None, 712 size: tuple[int, int] | None = None 713 ) 714 ``` 715 716 Create the ImageFileToImageContent component. 717 718 **Parameters:** 719 720 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 721 This will be passed to the created ImageContent objects. 722 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 723 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 724 when working with models that have resolution constraints or when transmitting images to remote services. 725 726 #### run 727 728 ```python 729 run( 730 sources: list[str | Path | ByteStream], 731 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 732 *, 733 detail: Literal["auto", "high", "low"] | None = None, 734 size: tuple[int, int] | None = None 735 ) -> dict[str, list[ImageContent]] 736 ``` 737 738 Converts files to ImageContent objects. 739 740 **Parameters:** 741 742 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 743 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 744 This value can be a list of dictionaries or a single dictionary. 745 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 746 If it's a list, its length must match the number of sources as they're zipped together. 747 For ByteStream objects, their `meta` is added to the output ImageContent objects. 748 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 749 This will be passed to the created ImageContent objects. 750 If not provided, the detail level will be the one set in the constructor. 751 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 752 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 753 when working with models that have resolution constraints or when transmitting images to remote services. 754 If not provided, the size value will be the one set in the constructor. 755 756 **Returns:** 757 758 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 759 - `image_contents`: A list of ImageContent objects. 760 761 ## image/pdf_to_image 762 763 ### PDFToImageContent 764 765 Converts PDF files to ImageContent objects. 766 767 ### Usage example 768 769 ```python 770 from haystack.components.converters.image import PDFToImageContent 771 772 converter = PDFToImageContent() 773 774 sources = ["file.pdf", "another_file.pdf"] 775 776 image_contents = converter.run(sources=sources)["image_contents"] 777 print(image_contents) 778 779 # [ImageContent(base64_image='...', 780 # mime_type='application/pdf', 781 # detail=None, 782 # meta={'file_path': 'file.pdf', 'page_number': 1}), 783 # ...] 784 ``` 785 786 #### __init__ 787 788 ```python 789 __init__( 790 *, 791 detail: Literal["auto", "high", "low"] | None = None, 792 size: tuple[int, int] | None = None, 793 page_range: list[str | int] | None = None 794 ) 795 ``` 796 797 Create the PDFToImageContent component. 798 799 **Parameters:** 800 801 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 802 This will be passed to the created ImageContent objects. 803 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 804 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 805 when working with models that have resolution constraints or when transmitting images to remote services. 806 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 807 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 808 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 809 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 810 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 811 812 #### run 813 814 ```python 815 run( 816 sources: list[str | Path | ByteStream], 817 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 818 *, 819 detail: Literal["auto", "high", "low"] | None = None, 820 size: tuple[int, int] | None = None, 821 page_range: list[str | int] | None = None 822 ) -> dict[str, list[ImageContent]] 823 ``` 824 825 Converts files to ImageContent objects. 826 827 **Parameters:** 828 829 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 830 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 831 This value can be a list of dictionaries or a single dictionary. 832 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 833 If it's a list, its length must match the number of sources as they're zipped together. 834 For ByteStream objects, their `meta` is added to the output ImageContent objects. 835 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 836 This will be passed to the created ImageContent objects. 837 If not provided, the detail level will be the one set in the constructor. 838 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 839 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 840 when working with models that have resolution constraints or when transmitting images to remote services. 841 If not provided, the size value will be the one set in the constructor. 842 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 843 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 844 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 845 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 846 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 847 If not provided, the page_range value will be the one set in the constructor. 848 849 **Returns:** 850 851 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 852 - `image_contents`: A list of ImageContent objects. 853 854 ## json 855 856 ### JSONConverter 857 858 Converts one or more JSON files into a text document. 859 860 ### Usage examples 861 862 ```python 863 import json 864 865 from haystack.components.converters import JSONConverter 866 from haystack.dataclasses import ByteStream 867 868 source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"})) 869 870 converter = JSONConverter(content_key="text") 871 results = converter.run(sources=[source]) 872 documents = results["documents"] 873 print(documents[0].content) 874 # 'This is the content of my document' 875 ``` 876 877 Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields` 878 to extract from the filtered data: 879 880 ```python 881 import json 882 883 from haystack.components.converters import JSONConverter 884 from haystack.dataclasses import ByteStream 885 886 data = { 887 "laureates": [ 888 { 889 "firstname": "Enrico", 890 "surname": "Fermi", 891 "motivation": "for his demonstrations of the existence of new radioactive elements produced " 892 "by neutron irradiation, and for his related discovery of nuclear reactions brought about by" 893 " slow neutrons", 894 }, 895 { 896 "firstname": "Rita", 897 "surname": "Levi-Montalcini", 898 "motivation": "for their discoveries of growth factors", 899 }, 900 ], 901 } 902 source = ByteStream.from_string(json.dumps(data)) 903 converter = JSONConverter( 904 jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"} 905 ) 906 907 results = converter.run(sources=[source]) 908 documents = results["documents"] 909 print(documents[0].content) 910 # 'for his demonstrations of the existence of new radioactive elements produced by 911 # neutron irradiation, and for his related discovery of nuclear reactions brought 912 # about by slow neutrons' 913 914 print(documents[0].meta) 915 # {'firstname': 'Enrico', 'surname': 'Fermi'} 916 917 print(documents[1].content) 918 # 'for their discoveries of growth factors' 919 920 print(documents[1].meta) 921 # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'} 922 ``` 923 924 #### __init__ 925 926 ```python 927 __init__( 928 jq_schema: str | None = None, 929 content_key: str | None = None, 930 extra_meta_fields: set[str] | Literal["*"] | None = None, 931 store_full_path: bool = False, 932 ) 933 ``` 934 935 Creates a JSONConverter component. 936 937 An optional `jq_schema` can be provided to extract nested data in the JSON source files. 938 See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax. 939 If `jq_schema` is not set, whole JSON source files will be used to extract content. 940 941 Optionally, you can provide a `content_key` to specify which key in the extracted object must 942 be set as the document's content. 943 944 If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in 945 the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped. 946 947 If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array, 948 it will be skipped. 949 950 If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped. 951 952 `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string. 953 If it's a set of strings, it must specify fields in the extracted objects that must be set in 954 the extracted documents. If a field is not found, the meta value will be `None`. 955 If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will 956 be saved as metadata. 957 958 Initialization will fail if neither `jq_schema` nor `content_key` are set. 959 960 **Parameters:** 961 962 - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content. 963 If not specified, whole JSON object will be used to extract information. 964 - **content_key** (<code>str | None</code>) – Optional key to extract document content. 965 If `jq_schema` is specified, the `content_key` will be extracted from that object. 966 - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content. 967 If `jq_schema` is specified, all keys will be extracted from that object. 968 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 969 If False, only the file name is stored. 970 971 #### to_dict 972 973 ```python 974 to_dict() -> dict[str, Any] 975 ``` 976 977 Serializes the component to a dictionary. 978 979 **Returns:** 980 981 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 982 983 #### from_dict 984 985 ```python 986 from_dict(data: dict[str, Any]) -> JSONConverter 987 ``` 988 989 Deserializes the component from a dictionary. 990 991 **Parameters:** 992 993 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 994 995 **Returns:** 996 997 - <code>JSONConverter</code> – Deserialized component. 998 999 #### run 1000 1001 ```python 1002 run( 1003 sources: list[str | Path | ByteStream], 1004 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1005 ) 1006 ``` 1007 1008 Converts a list of JSON files to documents. 1009 1010 **Parameters:** 1011 1012 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects. 1013 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1014 This value can be either a list of dictionaries or a single dictionary. 1015 If it's a single dictionary, its content is added to the metadata of all produced documents. 1016 If it's a list, the length of the list must match the number of sources. 1017 If `sources` contain ByteStream objects, their `meta` will be added to the output documents. 1018 1019 **Returns:** 1020 1021 - – A dictionary with the following keys: 1022 - `documents`: A list of created documents. 1023 1024 ## markdown 1025 1026 ### MarkdownToDocument 1027 1028 Converts a Markdown file into a text Document. 1029 1030 Usage example: 1031 1032 ```python 1033 from haystack.components.converters import MarkdownToDocument 1034 from datetime import datetime 1035 1036 converter = MarkdownToDocument() 1037 results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()}) 1038 documents = results["documents"] 1039 print(documents[0].content) 1040 # 'This is a text from the markdown file.' 1041 ``` 1042 1043 #### __init__ 1044 1045 ```python 1046 __init__( 1047 table_to_single_line: bool = False, 1048 progress_bar: bool = True, 1049 store_full_path: bool = False, 1050 ) 1051 ``` 1052 1053 Create a MarkdownToDocument component. 1054 1055 **Parameters:** 1056 1057 - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line. 1058 - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running. 1059 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1060 If False, only the file name is stored. 1061 1062 #### run 1063 1064 ```python 1065 run( 1066 sources: list[str | Path | ByteStream], 1067 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1068 ) 1069 ``` 1070 1071 Converts a list of Markdown files to Documents. 1072 1073 **Parameters:** 1074 1075 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1076 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1077 This value can be either a list of dictionaries or a single dictionary. 1078 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1079 If it's a list, the length of the list must match the number of sources, because the two lists will 1080 be zipped. 1081 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1082 1083 **Returns:** 1084 1085 - – A dictionary with the following keys: 1086 - `documents`: List of created Documents 1087 1088 ## msg 1089 1090 ### MSGToDocument 1091 1092 Converts Microsoft Outlook .msg files into Haystack Documents. 1093 1094 This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg 1095 files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg 1096 file are extracted as ByteStream objects. 1097 1098 ### Example Usage 1099 1100 ```python 1101 from haystack.components.converters.msg import MSGToDocument 1102 from datetime import datetime 1103 1104 converter = MSGToDocument() 1105 results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()}) 1106 documents = results["documents"] 1107 attachments = results["attachments"] 1108 print(documents[0].content) 1109 ``` 1110 1111 #### __init__ 1112 1113 ```python 1114 __init__(store_full_path: bool = False) -> None 1115 ``` 1116 1117 Creates a MSGToDocument component. 1118 1119 **Parameters:** 1120 1121 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1122 If False, only the file name is stored. 1123 1124 #### run 1125 1126 ```python 1127 run( 1128 sources: list[str | Path | ByteStream], 1129 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1130 ) -> dict[str, list[Document] | list[ByteStream]] 1131 ``` 1132 1133 Converts MSG files to Documents. 1134 1135 **Parameters:** 1136 1137 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1138 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1139 This value can be either a list of dictionaries or a single dictionary. 1140 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1141 If it's a list, the length of the list must match the number of sources, because the two lists will 1142 be zipped. 1143 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1144 1145 **Returns:** 1146 1147 - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys: 1148 - `documents`: Created Documents. 1149 - `attachments`: Created ByteStream objects from file attachments. 1150 1151 ## multi_file_converter 1152 1153 ### MultiFileConverter 1154 1155 A file converter that handles conversion of multiple file types. 1156 1157 The MultiFileConverter handles the following file types: 1158 1159 - CSV 1160 - DOCX 1161 - HTML 1162 - JSON 1163 - MD 1164 - TEXT 1165 - PDF (no OCR) 1166 - PPTX 1167 - XLSX 1168 1169 Usage example: 1170 1171 ``` 1172 from haystack.super_components.converters import MultiFileConverter 1173 1174 converter = MultiFileConverter() 1175 converter.run(sources=["test.txt", "test.pdf"], meta={}) 1176 ``` 1177 1178 #### __init__ 1179 1180 ```python 1181 __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None 1182 ``` 1183 1184 Initialize the MultiFileConverter. 1185 1186 **Parameters:** 1187 1188 - **encoding** (<code>str</code>) – The encoding to use when reading files. 1189 - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files. 1190 1191 ## openapi_functions 1192 1193 ### OpenAPIServiceToFunctions 1194 1195 Converts OpenAPI service definitions to a format suitable for OpenAI function calling. 1196 1197 The definition must respect OpenAPI specification 3.0.0 or higher. 1198 It can be specified in JSON or YAML format. 1199 Each function must have: 1200 \- unique operationId 1201 \- description 1202 \- requestBody and/or parameters 1203 \- schema for the requestBody and/or parameters 1204 For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification). 1205 For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling). 1206 1207 Usage example: 1208 1209 ```python 1210 from haystack.components.converters import OpenAPIServiceToFunctions 1211 1212 converter = OpenAPIServiceToFunctions() 1213 result = converter.run(sources=["path/to/openapi_definition.yaml"]) 1214 assert result["functions"] 1215 ``` 1216 1217 #### __init__ 1218 1219 ```python 1220 __init__() 1221 ``` 1222 1223 Create an OpenAPIServiceToFunctions component. 1224 1225 #### run 1226 1227 ```python 1228 run(sources: list[str | Path | ByteStream]) -> dict[str, Any] 1229 ``` 1230 1231 Converts OpenAPI definitions in OpenAI function calling format. 1232 1233 **Parameters:** 1234 1235 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format). 1236 1237 **Returns:** 1238 1239 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1240 - functions: Function definitions in JSON object format 1241 - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references 1242 1243 **Raises:** 1244 1245 - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed. 1246 - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions. 1247 1248 ## output_adapter 1249 1250 ### OutputAdaptationException 1251 1252 Bases: <code>Exception</code> 1253 1254 Exception raised when there is an error during output adaptation. 1255 1256 ### OutputAdapter 1257 1258 Adapts output of a Component using Jinja templates. 1259 1260 Usage example: 1261 1262 ```python 1263 from haystack import Document 1264 from haystack.components.converters import OutputAdapter 1265 1266 adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str) 1267 documents = [Document(content="Test content"] 1268 result = adapter.run(documents=documents) 1269 1270 assert result["output"] == "Test content" 1271 ``` 1272 1273 #### __init__ 1274 1275 ```python 1276 __init__( 1277 template: str, 1278 output_type: TypeAlias, 1279 custom_filters: dict[str, Callable] | None = None, 1280 unsafe: bool = False, 1281 ) -> None 1282 ``` 1283 1284 Create an OutputAdapter component. 1285 1286 **Parameters:** 1287 1288 - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data. 1289 The variables in the template define the input of this instance. 1290 e.g. 1291 With this template: 1292 1293 ``` 1294 {{ documents[0].content }} 1295 ``` 1296 1297 The Component input will be `documents`. 1298 1299 - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return. 1300 - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template. 1301 - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template. 1302 This should only be used if you trust the source of the template as it can be lead to remote code execution. 1303 1304 #### run 1305 1306 ```python 1307 run(**kwargs) 1308 ``` 1309 1310 Renders the Jinja template with the provided inputs. 1311 1312 **Parameters:** 1313 1314 - **kwargs** – Must contain all variables used in the `template` string. 1315 1316 **Returns:** 1317 1318 - – A dictionary with the following keys: 1319 - `output`: Rendered Jinja template. 1320 1321 **Raises:** 1322 1323 - <code>OutputAdaptationException</code> – If template rendering fails. 1324 1325 #### to_dict 1326 1327 ```python 1328 to_dict() -> dict[str, Any] 1329 ``` 1330 1331 Serializes the component to a dictionary. 1332 1333 **Returns:** 1334 1335 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1336 1337 #### from_dict 1338 1339 ```python 1340 from_dict(data: dict[str, Any]) -> OutputAdapter 1341 ``` 1342 1343 Deserializes the component from a dictionary. 1344 1345 **Parameters:** 1346 1347 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 1348 1349 **Returns:** 1350 1351 - <code>OutputAdapter</code> – The deserialized component. 1352 1353 ## pdfminer 1354 1355 ### PDFMinerToDocument 1356 1357 Converts PDF files to Documents. 1358 1359 Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/ 1360 1361 Usage example: 1362 1363 ```python 1364 from haystack.components.converters.pdfminer import PDFMinerToDocument 1365 1366 converter = PDFMinerToDocument() 1367 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1368 documents = results["documents"] 1369 print(documents[0].content) 1370 # 'This is a text from the PDF file.' 1371 ``` 1372 1373 #### __init__ 1374 1375 ```python 1376 __init__( 1377 line_overlap: float = 0.5, 1378 char_margin: float = 2.0, 1379 line_margin: float = 0.5, 1380 word_margin: float = 0.1, 1381 boxes_flow: float | None = 0.5, 1382 detect_vertical: bool = True, 1383 all_texts: bool = False, 1384 store_full_path: bool = False, 1385 ) -> None 1386 ``` 1387 1388 Create a PDFMinerToDocument component. 1389 1390 **Parameters:** 1391 1392 - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on 1393 the same line based on the amount of overlap between them. 1394 The overlap is calculated relative to the minimum height of both characters. 1395 - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them. 1396 If the distance is less than the margin specified, the characters are considered to be on the same line. 1397 The margin is calculated relative to the width of the character. 1398 - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word 1399 based on the distance between them. If the distance is greater than the margin specified, 1400 an intermediate space will be added between them to make the text more readable. 1401 The margin is calculated relative to the width of the character. 1402 - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on 1403 the distance between them. If the distance is less than the margin specified, 1404 the lines are considered to be part of the same paragraph. 1405 The margin is calculated relative to the height of a line. 1406 - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when 1407 determining the order of text boxes. A value between -1.0 and +1.0 can be set, 1408 with -1.0 indicating that only horizontal position matters and +1.0 indicating 1409 that only vertical position matters. Setting the value to 'None' will disable advanced 1410 layout analysis, and text boxes will be ordered based on the position of their bottom left corner. 1411 - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis. 1412 - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures. 1413 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1414 If False, only the file name is stored. 1415 1416 #### detect_undecoded_cid_characters 1417 1418 ```python 1419 detect_undecoded_cid_characters(text: str) -> dict[str, Any] 1420 ``` 1421 1422 Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format. 1423 1424 This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses 1425 non-standard fonts. 1426 1427 A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like 1428 searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor 1429 needs. If that map is not available the text extractor cannot decode the CID characters and will return them 1430 as is. 1431 1432 see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output 1433 1434 **Parameters:** 1435 1436 - **text** (<code>str</code>) – The text to check for undecoded CID characters 1437 1438 **Returns:** 1439 1440 - <code>dict\[str, Any\]</code> – A dictionary containing detection results 1441 1442 #### run 1443 1444 ```python 1445 run( 1446 sources: list[str | Path | ByteStream], 1447 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1448 ) 1449 ``` 1450 1451 Converts PDF files to Documents. 1452 1453 **Parameters:** 1454 1455 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects. 1456 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1457 This value can be either a list of dictionaries or a single dictionary. 1458 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1459 If it's a list, the length of the list must match the number of sources, because the two lists will 1460 be zipped. 1461 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1462 1463 **Returns:** 1464 1465 - – A dictionary with the following keys: 1466 - `documents`: Created Documents 1467 1468 ## pptx 1469 1470 ### PPTXToDocument 1471 1472 Converts PPTX files to Documents. 1473 1474 Usage example: 1475 1476 ```python 1477 from haystack.components.converters.pptx import PPTXToDocument 1478 1479 converter = PPTXToDocument() 1480 results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()}) 1481 documents = results["documents"] 1482 print(documents[0].content) 1483 # 'This is the text from the PPTX file.' 1484 ``` 1485 1486 #### __init__ 1487 1488 ```python 1489 __init__( 1490 store_full_path: bool = False, 1491 link_format: Literal["markdown", "plain", "none"] = "none", 1492 ) 1493 ``` 1494 1495 Create a PPTXToDocument component. 1496 1497 **Parameters:** 1498 1499 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1500 If False, only the file name is stored. 1501 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1502 - `"markdown"`: `[text](url)` 1503 - `"plain"`: `text (url)` 1504 - `"none"`: Only the text is extracted, link addresses are ignored. 1505 1506 #### to_dict 1507 1508 ```python 1509 to_dict() -> dict[str, Any] 1510 ``` 1511 1512 Serializes the component to a dictionary. 1513 1514 **Returns:** 1515 1516 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1517 1518 #### run 1519 1520 ```python 1521 run( 1522 sources: list[str | Path | ByteStream], 1523 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1524 ) 1525 ``` 1526 1527 Converts PPTX files to Documents. 1528 1529 **Parameters:** 1530 1531 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1532 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1533 This value can be either a list of dictionaries or a single dictionary. 1534 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1535 If it's a list, the length of the list must match the number of sources, because the two lists will 1536 be zipped. 1537 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1538 1539 **Returns:** 1540 1541 - – A dictionary with the following keys: 1542 - `documents`: Created Documents 1543 1544 ## pypdf 1545 1546 ### PyPDFExtractionMode 1547 1548 Bases: <code>Enum</code> 1549 1550 The mode to use for extracting text from a PDF. 1551 1552 #### from_str 1553 1554 ```python 1555 from_str(string: str) -> PyPDFExtractionMode 1556 ``` 1557 1558 Convert a string to a PyPDFExtractionMode enum. 1559 1560 ### PyPDFToDocument 1561 1562 Converts PDF files to documents your pipeline can query. 1563 1564 This component uses the PyPDF library. 1565 You can attach metadata to the resulting documents. 1566 1567 ### Usage example 1568 1569 ```python 1570 from haystack.components.converters.pypdf import PyPDFToDocument 1571 1572 converter = PyPDFToDocument() 1573 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1574 documents = results["documents"] 1575 print(documents[0].content) 1576 # 'This is a text from the PDF file.' 1577 ``` 1578 1579 #### __init__ 1580 1581 ```python 1582 __init__( 1583 *, 1584 extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN, 1585 plain_mode_orientations: tuple = (0, 90, 180, 270), 1586 plain_mode_space_width: float = 200.0, 1587 layout_mode_space_vertically: bool = True, 1588 layout_mode_scale_weight: float = 1.25, 1589 layout_mode_strip_rotated: bool = True, 1590 layout_mode_font_height_weight: float = 1.0, 1591 store_full_path: bool = False 1592 ) 1593 ``` 1594 1595 Create an PyPDFToDocument component. 1596 1597 **Parameters:** 1598 1599 - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF. 1600 Layout mode is an experimental mode that adheres to the rendered layout of the PDF. 1601 - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode. 1602 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1603 - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font. 1604 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1605 - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height. 1606 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1607 - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width. 1608 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1609 - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway. 1610 If rotated text is discovered, layout will be degraded and a warning will be logged. 1611 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1612 - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height. 1613 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1614 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1615 If False, only the file name is stored. 1616 1617 #### to_dict 1618 1619 ```python 1620 to_dict() 1621 ``` 1622 1623 Serializes the component to a dictionary. 1624 1625 **Returns:** 1626 1627 - – Dictionary with serialized data. 1628 1629 #### from_dict 1630 1631 ```python 1632 from_dict(data) 1633 ``` 1634 1635 Deserializes the component from a dictionary. 1636 1637 **Parameters:** 1638 1639 - **data** – Dictionary with serialized data. 1640 1641 **Returns:** 1642 1643 - – Deserialized component. 1644 1645 #### run 1646 1647 ```python 1648 run( 1649 sources: list[str | Path | ByteStream], 1650 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1651 ) 1652 ``` 1653 1654 Converts PDF files to documents. 1655 1656 **Parameters:** 1657 1658 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 1659 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1660 This value can be a list of dictionaries or a single dictionary. 1661 If it's a single dictionary, its content is added to the metadata of all produced documents. 1662 If it's a list, its length must match the number of sources, as they are zipped together. 1663 For ByteStream objects, their `meta` is added to the output documents. 1664 1665 **Returns:** 1666 1667 - – A dictionary with the following keys: 1668 - `documents`: A list of converted documents. 1669 1670 ## tika 1671 1672 ### XHTMLParser 1673 1674 Bases: <code>HTMLParser</code> 1675 1676 Custom parser to extract pages from Tika XHTML content. 1677 1678 #### handle_starttag 1679 1680 ```python 1681 handle_starttag(tag: str, attrs: list[tuple]) 1682 ``` 1683 1684 Identify the start of a page div. 1685 1686 #### handle_endtag 1687 1688 ```python 1689 handle_endtag(tag: str) 1690 ``` 1691 1692 Identify the end of a page div. 1693 1694 #### handle_data 1695 1696 ```python 1697 handle_data(data: str) 1698 ``` 1699 1700 Populate the page content. 1701 1702 ### TikaDocumentConverter 1703 1704 Converts files of different types to Documents. 1705 1706 This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, 1707 requires a running Tika server. 1708 For more options on running Tika, 1709 see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). 1710 1711 Usage example: 1712 1713 ```python 1714 from haystack.components.converters.tika import TikaDocumentConverter 1715 1716 converter = TikaDocumentConverter() 1717 results = converter.run( 1718 sources=["sample.docx", "my_document.rtf", "archive.zip"], 1719 meta={"date_added": datetime.now().isoformat()} 1720 ) 1721 documents = results["documents"] 1722 print(documents[0].content) 1723 # 'This is a text from the docx file.' 1724 ``` 1725 1726 #### __init__ 1727 1728 ```python 1729 __init__( 1730 tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False 1731 ) 1732 ``` 1733 1734 Create a TikaDocumentConverter component. 1735 1736 **Parameters:** 1737 1738 - **tika_url** (<code>str</code>) – Tika server URL. 1739 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1740 If False, only the file name is stored. 1741 1742 #### run 1743 1744 ```python 1745 run( 1746 sources: list[str | Path | ByteStream], 1747 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1748 ) 1749 ``` 1750 1751 Converts files to Documents. 1752 1753 **Parameters:** 1754 1755 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 1756 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1757 This value can be either a list of dictionaries or a single dictionary. 1758 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1759 If it's a list, the length of the list must match the number of sources, because the two lists will 1760 be zipped. 1761 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1762 1763 **Returns:** 1764 1765 - – A dictionary with the following keys: 1766 - `documents`: Created Documents 1767 1768 ## txt 1769 1770 ### TextFileToDocument 1771 1772 Converts text files to documents your pipeline can query. 1773 1774 By default, it uses UTF-8 encoding when converting files but 1775 you can also set custom encoding. 1776 It can attach metadata to the resulting documents. 1777 1778 ### Usage example 1779 1780 ```python 1781 from haystack.components.converters.txt import TextFileToDocument 1782 1783 converter = TextFileToDocument() 1784 results = converter.run(sources=["sample.txt"]) 1785 documents = results["documents"] 1786 print(documents[0].content) 1787 # 'This is the content from the txt file.' 1788 ``` 1789 1790 #### __init__ 1791 1792 ```python 1793 __init__(encoding: str = 'utf-8', store_full_path: bool = False) 1794 ``` 1795 1796 Creates a TextFileToDocument component. 1797 1798 **Parameters:** 1799 1800 - **encoding** (<code>str</code>) – The encoding of the text files to convert. 1801 If the encoding is specified in the metadata of a source ByteStream, 1802 it overrides this value. 1803 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1804 If False, only the file name is stored. 1805 1806 #### run 1807 1808 ```python 1809 run( 1810 sources: list[str | Path | ByteStream], 1811 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1812 ) 1813 ``` 1814 1815 Converts text files to documents. 1816 1817 **Parameters:** 1818 1819 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert. 1820 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1821 This value can be a list of dictionaries or a single dictionary. 1822 If it's a single dictionary, its content is added to the metadata of all produced documents. 1823 If it's a list, its length must match the number of sources as they're zipped together. 1824 For ByteStream objects, their `meta` is added to the output documents. 1825 1826 **Returns:** 1827 1828 - – A dictionary with the following keys: 1829 - `documents`: A list of converted documents. 1830 1831 ## xlsx 1832 1833 ### XLSXToDocument 1834 1835 ```` 1836 Converts XLSX (Excel) files into Documents. 1837 1838 Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is 1839 created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format. 1840 1841 ### Usage example 1842 1843 ```python 1844 from haystack.components.converters.xlsx import XLSXToDocument 1845 1846 converter = XLSXToDocument() 1847 results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()}) 1848 documents = results["documents"] 1849 print(documents[0].content) 1850 # ",A,B 1851 ```` 1852 1853 1,col_a,col_b 1854 2,1.5,test 1855 " 1856 \`\`\` 1857 1858 #### __init__ 1859 1860 ```python 1861 __init__( 1862 table_format: Literal["csv", "markdown"] = "csv", 1863 sheet_name: str | int | list[str | int] | None = None, 1864 read_excel_kwargs: dict[str, Any] | None = None, 1865 table_format_kwargs: dict[str, Any] | None = None, 1866 *, 1867 link_format: Literal["markdown", "plain", "none"] = "none", 1868 store_full_path: bool = False 1869 ) 1870 ``` 1871 1872 Creates a XLSXToDocument component. 1873 1874 **Parameters:** 1875 1876 - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to. 1877 - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read. 1878 - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`. 1879 See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel 1880 - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function. 1881 - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`. 1882 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv 1883 - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`. 1884 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown 1885 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1886 - `"markdown"`: `[text](url)` 1887 - `"plain"`: `text (url)` 1888 - `"none"`: Only the text is extracted, link addresses are ignored. 1889 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1890 If False, only the file name is stored. 1891 1892 #### run 1893 1894 ```python 1895 run( 1896 sources: list[str | Path | ByteStream], 1897 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1898 ) -> dict[str, list[Document]] 1899 ``` 1900 1901 Converts a XLSX file to a Document. 1902 1903 **Parameters:** 1904 1905 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1906 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1907 This value can be either a list of dictionaries or a single dictionary. 1908 If it's a single dictionary, its content is added to the metadata of all produced documents. 1909 If it's a list, the length of the list must match the number of sources, because the two lists will 1910 be zipped. 1911 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 1912 1913 **Returns:** 1914 1915 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1916 - `documents`: Created documents