converters_api.md
1 --- 2 title: "Converters" 3 id: converters-api 4 description: "Various converters to transform data from one format to another." 5 slug: "/converters-api" 6 --- 7 8 9 ## azure 10 11 ### AzureOCRDocumentConverter 12 13 Converts files to documents using Azure's Document Intelligence service. 14 15 Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML. 16 17 To use this component, you need an active Azure account 18 and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see 19 [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). 20 21 ### Usage example 22 23 ```python 24 import os 25 from datetime import datetime 26 from haystack.components.converters import AzureOCRDocumentConverter 27 from haystack.utils import Secret 28 29 converter = AzureOCRDocumentConverter( 30 endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"], 31 api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"), 32 ) 33 results = converter.run( 34 sources=["test/test_files/pdf/react_paper.pdf"], 35 meta={"date_added": datetime.now().isoformat()}, 36 ) 37 documents = results["documents"] 38 print(documents[0].content) 39 # 'This is a text from the PDF file.' 40 ``` 41 42 #### __init__ 43 44 ```python 45 __init__( 46 endpoint: str, 47 api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"), 48 model_id: str = "prebuilt-read", 49 preceding_context_len: int = 3, 50 following_context_len: int = 3, 51 merge_multiple_column_headers: bool = True, 52 page_layout: Literal["natural", "single_column"] = "natural", 53 threshold_y: float | None = 0.05, 54 store_full_path: bool = False, 55 ) -> None 56 ``` 57 58 Creates an AzureOCRDocumentConverter component. 59 60 **Parameters:** 61 62 - **endpoint** (<code>str</code>) – The endpoint of your Azure resource. 63 - **api_key** (<code>Secret</code>) – The API key of your Azure resource. 64 - **model_id** (<code>str</code>) – The ID of the model you want to use. For a list of available models, see [Azure documentation] 65 (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). 66 - **preceding_context_len** (<code>int</code>) – Number of lines before a table to include as preceding context 67 (this will be added to the metadata). 68 - **following_context_len** (<code>int</code>) – Number of lines after a table to include as subsequent context ( 69 this will be added to the metadata). 70 - **merge_multiple_column_headers** (<code>bool</code>) – If `True`, merges multiple column header rows into a single row. 71 - **page_layout** (<code>Literal['natural', 'single_column']</code>) – The type reading order to follow. Possible options: 72 - `natural`: Uses the natural reading order determined by Azure. 73 - `single_column`: Groups all lines with the same height on the page based on a threshold 74 determined by `threshold_y`. 75 - **threshold_y** (<code>float | None</code>) – Only relevant if `single_column` is set to `page_layout`. 76 The threshold, in inches, to determine if two recognized PDF elements are grouped into a 77 single line. This is crucial for section headers or numbers which may be spatially separated 78 from the remaining text on the horizontal axis. 79 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 80 If False, only the file name is stored. 81 82 #### run 83 84 ```python 85 run( 86 sources: list[str | Path | ByteStream], 87 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 88 ) -> dict[str, Any] 89 ``` 90 91 Convert a list of files to Documents using Azure's Document Intelligence service. 92 93 **Parameters:** 94 95 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 96 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 97 This value can be either a list of dictionaries or a single dictionary. 98 If it's a single dictionary, its content is added to the metadata of all produced Documents. 99 If it's a list, the length of the list must match the number of sources, because the two lists will be 100 zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 101 102 **Returns:** 103 104 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 105 - `documents`: List of created Documents 106 - `raw_azure_response`: List of raw Azure responses used to create the Documents 107 108 #### to_dict 109 110 ```python 111 to_dict() -> dict[str, Any] 112 ``` 113 114 Serializes the component to a dictionary. 115 116 **Returns:** 117 118 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 119 120 #### from_dict 121 122 ```python 123 from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter 124 ``` 125 126 Deserializes the component from a dictionary. 127 128 **Parameters:** 129 130 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 131 132 **Returns:** 133 134 - <code>AzureOCRDocumentConverter</code> – The deserialized component. 135 136 ## csv 137 138 ### CSVToDocument 139 140 Converts CSV files to Documents. 141 142 By default, it uses UTF-8 encoding when converting files but 143 you can also set a custom encoding. 144 It can attach metadata to the resulting documents. 145 146 ### Usage example 147 148 ```python 149 from haystack.components.converters.csv import CSVToDocument 150 from datetime import datetime 151 converter = CSVToDocument() 152 results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()}) 153 documents = results["documents"] 154 print(documents[0].content) 155 # 'col1,col2\nrow1,row1\nrow2,row2\n' 156 ``` 157 158 #### __init__ 159 160 ```python 161 __init__( 162 encoding: str = "utf-8", 163 store_full_path: bool = False, 164 *, 165 conversion_mode: Literal["file", "row"] = "file", 166 delimiter: str = ",", 167 quotechar: str = '"' 168 ) -> None 169 ``` 170 171 Creates a CSVToDocument component. 172 173 **Parameters:** 174 175 - **encoding** (<code>str</code>) – The encoding of the csv files to convert. 176 If the encoding is specified in the metadata of a source ByteStream, 177 it overrides this value. 178 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 179 If False, only the file name is stored. 180 - **conversion_mode** (<code>Literal['file', 'row']</code>) – - "file" (default): one Document per CSV file whose content is the raw CSV text. 181 - "row": convert each CSV row to its own Document (requires `content_column` in `run()`). 182 - **delimiter** (<code>str</code>) – CSV delimiter used when parsing in row mode (passed to `csv.DictReader`). 183 - **quotechar** (<code>str</code>) – CSV quote character used when parsing in row mode (passed to `csv.DictReader`). 184 185 #### run 186 187 ```python 188 run( 189 sources: list[str | Path | ByteStream], 190 *, 191 content_column: str | None = None, 192 meta: dict[str, Any] | list[dict[str, Any]] | None = None 193 ) -> dict[str, Any] 194 ``` 195 196 Converts CSV files to a Document (file mode) or to one Document per row (row mode). 197 198 **Parameters:** 199 200 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 201 - **content_column** (<code>str | None</code>) – **Required when** `conversion_mode="row"`. 202 The column name whose values become `Document.content` for each row. 203 The column must exist in the CSV header. 204 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 205 This value can be either a list of dictionaries or a single dictionary. 206 If it's a single dictionary, its content is added to the metadata of all produced documents. 207 If it's a list, the length of the list must match the number of sources, because the two lists will 208 be zipped. 209 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 210 211 **Returns:** 212 213 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 214 - `documents`: Created documents 215 216 ## docx 217 218 ### DOCXMetadata 219 220 Describes the metadata of Docx file. 221 222 **Parameters:** 223 224 - **author** (<code>str</code>) – The author 225 - **category** (<code>str</code>) – The category 226 - **comments** (<code>str</code>) – The comments 227 - **content_status** (<code>str</code>) – The content status 228 - **created** (<code>str | None</code>) – The creation date (ISO formatted string) 229 - **identifier** (<code>str</code>) – The identifier 230 - **keywords** (<code>str</code>) – Available keywords 231 - **language** (<code>str</code>) – The language of the document 232 - **last_modified_by** (<code>str</code>) – User who last modified the document 233 - **last_printed** (<code>str | None</code>) – The last printed date (ISO formatted string) 234 - **modified** (<code>str | None</code>) – The last modification date (ISO formatted string) 235 - **revision** (<code>int</code>) – The revision number 236 - **subject** (<code>str</code>) – The subject 237 - **title** (<code>str</code>) – The title 238 - **version** (<code>str</code>) – The version 239 240 ### DOCXTableFormat 241 242 Bases: <code>Enum</code> 243 244 Supported formats for storing DOCX tabular data in a Document. 245 246 #### from_str 247 248 ```python 249 from_str(string: str) -> DOCXTableFormat 250 ``` 251 252 Convert a string to a DOCXTableFormat enum. 253 254 ### DOCXLinkFormat 255 256 Bases: <code>Enum</code> 257 258 Supported formats for storing DOCX link information in a Document. 259 260 #### from_str 261 262 ```python 263 from_str(string: str) -> DOCXLinkFormat 264 ``` 265 266 Convert a string to a DOCXLinkFormat enum. 267 268 ### DOCXToDocument 269 270 Converts DOCX files to Documents. 271 272 Uses `python-docx` library to convert the DOCX file to a document. 273 This component does not preserve page breaks in the original document. 274 275 Usage example: 276 277 ```python 278 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat 279 280 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN) 281 results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()}) 282 documents = results["documents"] 283 print(documents[0].content) 284 # 'This is a text from the DOCX file.' 285 ``` 286 287 #### __init__ 288 289 ```python 290 __init__( 291 table_format: str | DOCXTableFormat = DOCXTableFormat.CSV, 292 link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE, 293 store_full_path: bool = False, 294 ) -> None 295 ``` 296 297 Create a DOCXToDocument component. 298 299 **Parameters:** 300 301 - **table_format** (<code>str | DOCXTableFormat</code>) – The format for table output. Can be either DOCXTableFormat.MARKDOWN, 302 DOCXTableFormat.CSV, "markdown", or "csv". 303 - **link_format** (<code>str | DOCXLinkFormat</code>) – The format for link output. Can be either: 304 DOCXLinkFormat.MARKDOWN or "markdown" to get `[text](address)`, 305 DOCXLinkFormat.PLAIN or "plain" to get text (address), 306 DOCXLinkFormat.NONE or "none" to get text without links. 307 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 308 If False, only the file name is stored. 309 310 #### to_dict 311 312 ```python 313 to_dict() -> dict[str, Any] 314 ``` 315 316 Serializes the component to a dictionary. 317 318 **Returns:** 319 320 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 321 322 #### from_dict 323 324 ```python 325 from_dict(data: dict[str, Any]) -> DOCXToDocument 326 ``` 327 328 Deserializes the component from a dictionary. 329 330 **Parameters:** 331 332 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 333 334 **Returns:** 335 336 - <code>DOCXToDocument</code> – The deserialized component. 337 338 #### run 339 340 ```python 341 run( 342 sources: list[str | Path | ByteStream], 343 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 344 ) -> dict[str, Any] 345 ``` 346 347 Converts DOCX files to Documents. 348 349 **Parameters:** 350 351 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 352 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 353 This value can be either a list of dictionaries or a single dictionary. 354 If it's a single dictionary, its content is added to the metadata of all produced Documents. 355 If it's a list, the length of the list must match the number of sources, because the two lists will 356 be zipped. 357 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 358 359 **Returns:** 360 361 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 362 - `documents`: Created Documents 363 364 ## file_to_file_content 365 366 ### FileToFileContent 367 368 Converts files to FileContent objects to be included in ChatMessage objects. 369 370 ### Usage example 371 372 ```python 373 from haystack.components.converters import FileToFileContent 374 375 converter = FileToFileContent() 376 377 sources = ["document.pdf", "video.mp4"] 378 379 file_contents = converter.run(sources=sources)["file_contents"] 380 print(file_contents) 381 382 # [FileContent(base64_data='...', 383 # mime_type='application/pdf', 384 # filename='document.pdf', 385 # extra={}), 386 # ...] 387 ``` 388 389 #### run 390 391 ```python 392 run( 393 sources: list[str | Path | ByteStream], 394 *, 395 extra: dict[str, Any] | list[dict[str, Any]] | None = None 396 ) -> dict[str, list[FileContent]] 397 ``` 398 399 Converts files to FileContent objects. 400 401 **Parameters:** 402 403 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 404 - **extra** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific 405 information. 406 To avoid serialization issues, values should be JSON serializable. 407 This value can be a list of dictionaries or a single dictionary. 408 If it's a single dictionary, its content is added to the extra of all produced FileContent objects. 409 If it's a list, its length must match the number of sources as they're zipped together. 410 411 **Returns:** 412 413 - <code>dict\[str, list\[FileContent\]\]</code> – A dictionary with the following keys: 414 - `file_contents`: A list of FileContent objects. 415 416 ## html 417 418 ### HTMLToDocument 419 420 Converts an HTML file to a Document. 421 422 Usage example: 423 424 ```python 425 from haystack.components.converters import HTMLToDocument 426 427 converter = HTMLToDocument() 428 results = converter.run(sources=["path/to/sample.html"]) 429 documents = results["documents"] 430 print(documents[0].content) 431 # 'This is a text from the HTML file.' 432 ``` 433 434 #### __init__ 435 436 ```python 437 __init__( 438 extraction_kwargs: dict[str, Any] | None = None, 439 store_full_path: bool = False, 440 ) -> None 441 ``` 442 443 Create an HTMLToDocument component. 444 445 **Parameters:** 446 447 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary containing keyword arguments to customize the extraction process. These 448 are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see 449 the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 450 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 451 If False, only the file name is stored. 452 453 #### to_dict 454 455 ```python 456 to_dict() -> dict[str, Any] 457 ``` 458 459 Serializes the component to a dictionary. 460 461 **Returns:** 462 463 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 464 465 #### from_dict 466 467 ```python 468 from_dict(data: dict[str, Any]) -> HTMLToDocument 469 ``` 470 471 Deserializes the component from a dictionary. 472 473 **Parameters:** 474 475 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 476 477 **Returns:** 478 479 - <code>HTMLToDocument</code> – The deserialized component. 480 481 #### run 482 483 ```python 484 run( 485 sources: list[str | Path | ByteStream], 486 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 487 extraction_kwargs: dict[str, Any] | None = None, 488 ) -> dict[str, Any] 489 ``` 490 491 Converts a list of HTML files to Documents. 492 493 **Parameters:** 494 495 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 496 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 497 This value can be either a list of dictionaries or a single dictionary. 498 If it's a single dictionary, its content is added to the metadata of all produced Documents. 499 If it's a list, the length of the list must match the number of sources, because the two lists will 500 be zipped. 501 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 502 - **extraction_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to customize the extraction process. 503 504 **Returns:** 505 506 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 507 - `documents`: Created Documents 508 509 ## image/document_to_image 510 511 ### DocumentToImageContent 512 513 Converts documents sourced from PDF and image files into ImageContents. 514 515 This component processes a list of documents and extracts visual content from supported file formats, converting 516 them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF 517 documents by extracting specific pages as images. 518 519 Documents are expected to have metadata containing: 520 521 - The `file_path_meta_field` key with a valid file path that exists when combined with `root_path` 522 - A supported image format (MIME type must be one of the supported image types) 523 - For PDF files, a `page_number` key specifying which page to extract 524 525 ### Usage example 526 527 ```` 528 ```python 529 from haystack import Document 530 from haystack.components.converters.image.document_to_image import DocumentToImageContent 531 532 converter = DocumentToImageContent( 533 file_path_meta_field="file_path", 534 root_path="/data/files", 535 detail="high", 536 size=(800, 600) 537 ) 538 539 documents = [ 540 Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}), 541 Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1}) 542 ] 543 544 result = converter.run(documents) 545 image_contents = result["image_contents"] 546 # [ImageContent( 547 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'} 548 # ), 549 # ImageContent( 550 # base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', 551 # meta={'page_number': 1, 'file_path': 'doc.pdf'} 552 # )] 553 ``` 554 ```` 555 556 #### __init__ 557 558 ```python 559 __init__( 560 *, 561 file_path_meta_field: str = "file_path", 562 root_path: str | None = None, 563 detail: Literal["auto", "high", "low"] | None = None, 564 size: tuple[int, int] | None = None 565 ) -> None 566 ``` 567 568 Initialize the DocumentToImageContent component. 569 570 **Parameters:** 571 572 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 573 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 574 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 575 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 576 This will be passed to the created ImageContent objects. 577 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 578 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 579 when working with models that have resolution constraints or when transmitting images to remote services. 580 581 #### run 582 583 ```python 584 run(documents: list[Document]) -> dict[str, list[ImageContent | None]] 585 ``` 586 587 Convert documents with image or PDF sources into ImageContent objects. 588 589 This method processes the input documents, extracting images from supported file formats and converting them 590 into ImageContent objects. 591 592 **Parameters:** 593 594 - **documents** (<code>list\[Document\]</code>) – A list of documents to process. Each document should have metadata containing at minimum 595 a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which 596 page to convert. 597 598 **Returns:** 599 600 - <code>dict\[str, list\[ImageContent | None\]\]</code> – Dictionary containing one key: 601 - "image_contents": ImageContents created from the processed documents. These contain base64-encoded image 602 data and metadata. The order corresponds to order of input documents. 603 604 **Raises:** 605 606 - <code>ValueError</code> – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported 607 MIME type. The error message will specify which document and what information is missing or incorrect. 608 609 ## image/file_to_document 610 611 ### ImageFileToDocument 612 613 Converts image file references into empty Document objects with associated metadata. 614 615 This component is useful in pipelines where image file paths need to be wrapped in `Document` objects to be 616 processed by downstream components such as the `SentenceTransformersImageDocumentEmbedder`. 617 618 It does **not** extract any content from the image files, instead it creates `Document` objects with `None` as 619 their content and attaches metadata such as file path and any user-provided values. 620 621 ### Usage example 622 623 ```python 624 from haystack.components.converters.image import ImageFileToDocument 625 626 converter = ImageFileToDocument() 627 628 sources = ["image.jpg", "another_image.png"] 629 630 result = converter.run(sources=sources) 631 documents = result["documents"] 632 633 print(documents) 634 635 # [Document(id=..., meta: {'file_path': 'image.jpg'}), 636 # Document(id=..., meta: {'file_path': 'another_image.png'})] 637 ``` 638 639 #### __init__ 640 641 ```python 642 __init__(*, store_full_path: bool = False) -> None 643 ``` 644 645 Initialize the ImageFileToDocument component. 646 647 **Parameters:** 648 649 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 650 If False, only the file name is stored. 651 652 #### run 653 654 ```python 655 run( 656 *, 657 sources: list[str | Path | ByteStream], 658 meta: dict[str, Any] | list[dict[str, Any]] | None = None 659 ) -> dict[str, list[Document]] 660 ``` 661 662 Convert image files into empty Document objects with metadata. 663 664 This method accepts image file references (as file paths or ByteStreams) and creates `Document` objects 665 without content. These documents are enriched with metadata derived from the input source and optional 666 user-provided metadata. 667 668 **Parameters:** 669 670 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 671 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 672 This value can be a list of dictionaries or a single dictionary. 673 If it's a single dictionary, its content is added to the metadata of all produced documents. 674 If it's a list, its length must match the number of sources, as they are zipped together. 675 For ByteStream objects, their `meta` is added to the output documents. 676 677 **Returns:** 678 679 - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing: 680 - `documents`: A list of `Document` objects with empty content and associated metadata. 681 682 ## image/file_to_image 683 684 ### ImageFileToImageContent 685 686 Converts image files to ImageContent objects. 687 688 ### Usage example 689 690 ```python 691 from haystack.components.converters.image import ImageFileToImageContent 692 693 converter = ImageFileToImageContent() 694 695 sources = ["image.jpg", "another_image.png"] 696 697 image_contents = converter.run(sources=sources)["image_contents"] 698 print(image_contents) 699 700 # [ImageContent(base64_image='...', 701 # mime_type='image/jpeg', 702 # detail=None, 703 # meta={'file_path': 'image.jpg'}), 704 # ...] 705 ``` 706 707 #### __init__ 708 709 ```python 710 __init__( 711 *, 712 detail: Literal["auto", "high", "low"] | None = None, 713 size: tuple[int, int] | None = None 714 ) -> None 715 ``` 716 717 Create the ImageFileToImageContent component. 718 719 **Parameters:** 720 721 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 722 This will be passed to the created ImageContent objects. 723 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 724 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 725 when working with models that have resolution constraints or when transmitting images to remote services. 726 727 #### run 728 729 ```python 730 run( 731 sources: list[str | Path | ByteStream], 732 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 733 *, 734 detail: Literal["auto", "high", "low"] | None = None, 735 size: tuple[int, int] | None = None 736 ) -> dict[str, list[ImageContent]] 737 ``` 738 739 Converts files to ImageContent objects. 740 741 **Parameters:** 742 743 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 744 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 745 This value can be a list of dictionaries or a single dictionary. 746 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 747 If it's a list, its length must match the number of sources as they're zipped together. 748 For ByteStream objects, their `meta` is added to the output ImageContent objects. 749 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 750 This will be passed to the created ImageContent objects. 751 If not provided, the detail level will be the one set in the constructor. 752 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 753 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 754 when working with models that have resolution constraints or when transmitting images to remote services. 755 If not provided, the size value will be the one set in the constructor. 756 757 **Returns:** 758 759 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 760 - `image_contents`: A list of ImageContent objects. 761 762 ## image/pdf_to_image 763 764 ### PDFToImageContent 765 766 Converts PDF files to ImageContent objects. 767 768 ### Usage example 769 770 ```python 771 from haystack.components.converters.image import PDFToImageContent 772 773 converter = PDFToImageContent() 774 775 sources = ["file.pdf", "another_file.pdf"] 776 777 image_contents = converter.run(sources=sources)["image_contents"] 778 print(image_contents) 779 780 # [ImageContent(base64_image='...', 781 # mime_type='application/pdf', 782 # detail=None, 783 # meta={'file_path': 'file.pdf', 'page_number': 1}), 784 # ...] 785 ``` 786 787 #### __init__ 788 789 ```python 790 __init__( 791 *, 792 detail: Literal["auto", "high", "low"] | None = None, 793 size: tuple[int, int] | None = None, 794 page_range: list[str | int] | None = None 795 ) -> None 796 ``` 797 798 Create the PDFToImageContent component. 799 800 **Parameters:** 801 802 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 803 This will be passed to the created ImageContent objects. 804 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 805 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 806 when working with models that have resolution constraints or when transmitting images to remote services. 807 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 808 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 809 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 810 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 811 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 812 813 #### run 814 815 ```python 816 run( 817 sources: list[str | Path | ByteStream], 818 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 819 *, 820 detail: Literal["auto", "high", "low"] | None = None, 821 size: tuple[int, int] | None = None, 822 page_range: list[str | int] | None = None 823 ) -> dict[str, list[ImageContent]] 824 ``` 825 826 Converts files to ImageContent objects. 827 828 **Parameters:** 829 830 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 831 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the ImageContent objects. 832 This value can be a list of dictionaries or a single dictionary. 833 If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. 834 If it's a list, its length must match the number of sources as they're zipped together. 835 For ByteStream objects, their `meta` is added to the output ImageContent objects. 836 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". 837 This will be passed to the created ImageContent objects. 838 If not provided, the detail level will be the one set in the constructor. 839 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within the specified dimensions (width, height) while 840 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 841 when working with models that have resolution constraints or when transmitting images to remote services. 842 If not provided, the size value will be the one set in the constructor. 843 - **page_range** (<code>list\[str | int\] | None</code>) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. 844 If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) 845 will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third 846 pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] 847 will convert pages 1, 2, 3, 5, 8, 10, 11, 12. 848 If not provided, the page_range value will be the one set in the constructor. 849 850 **Returns:** 851 852 - <code>dict\[str, list\[ImageContent\]\]</code> – A dictionary with the following keys: 853 - `image_contents`: A list of ImageContent objects. 854 855 ## json 856 857 ### JSONConverter 858 859 Converts one or more JSON files into a text document. 860 861 ### Usage examples 862 863 ```python 864 import json 865 866 from haystack.components.converters import JSONConverter 867 from haystack.dataclasses import ByteStream 868 869 source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"})) 870 871 converter = JSONConverter(content_key="text") 872 results = converter.run(sources=[source]) 873 documents = results["documents"] 874 print(documents[0].content) 875 # 'This is the content of my document' 876 ``` 877 878 Optionally, you can also provide a `jq_schema` string to filter the JSON source files and `extra_meta_fields` 879 to extract from the filtered data: 880 881 ```python 882 import json 883 884 from haystack.components.converters import JSONConverter 885 from haystack.dataclasses import ByteStream 886 887 data = { 888 "laureates": [ 889 { 890 "firstname": "Enrico", 891 "surname": "Fermi", 892 "motivation": "for his demonstrations of the existence of new radioactive elements produced " 893 "by neutron irradiation, and for his related discovery of nuclear reactions brought about by" 894 " slow neutrons", 895 }, 896 { 897 "firstname": "Rita", 898 "surname": "Levi-Montalcini", 899 "motivation": "for their discoveries of growth factors", 900 }, 901 ], 902 } 903 source = ByteStream.from_string(json.dumps(data)) 904 converter = JSONConverter( 905 jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"} 906 ) 907 908 results = converter.run(sources=[source]) 909 documents = results["documents"] 910 print(documents[0].content) 911 # 'for his demonstrations of the existence of new radioactive elements produced by 912 # neutron irradiation, and for his related discovery of nuclear reactions brought 913 # about by slow neutrons' 914 915 print(documents[0].meta) 916 # {'firstname': 'Enrico', 'surname': 'Fermi'} 917 918 print(documents[1].content) 919 # 'for their discoveries of growth factors' 920 921 print(documents[1].meta) 922 # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'} 923 ``` 924 925 #### __init__ 926 927 ```python 928 __init__( 929 jq_schema: str | None = None, 930 content_key: str | None = None, 931 extra_meta_fields: set[str] | Literal["*"] | None = None, 932 store_full_path: bool = False, 933 ) -> None 934 ``` 935 936 Creates a JSONConverter component. 937 938 An optional `jq_schema` can be provided to extract nested data in the JSON source files. 939 See the [official jq documentation](https://jqlang.github.io/jq/) for more info on the filters syntax. 940 If `jq_schema` is not set, whole JSON source files will be used to extract content. 941 942 Optionally, you can provide a `content_key` to specify which key in the extracted object must 943 be set as the document's content. 944 945 If both `jq_schema` and `content_key` are set, the component will search for the `content_key` in 946 the JSON object extracted by `jq_schema`. If the extracted data is not a JSON object, it will be skipped. 947 948 If only `jq_schema` is set, the extracted data must be a scalar value. If it's a JSON object or array, 949 it will be skipped. 950 951 If only `content_key` is set, the source JSON file must be a JSON object, else it will be skipped. 952 953 `extra_meta_fields` can either be set to a set of strings or a literal `"*"` string. 954 If it's a set of strings, it must specify fields in the extracted objects that must be set in 955 the extracted documents. If a field is not found, the meta value will be `None`. 956 If set to `"*"`, all fields that are not `content_key` found in the filtered JSON object will 957 be saved as metadata. 958 959 Initialization will fail if neither `jq_schema` nor `content_key` are set. 960 961 **Parameters:** 962 963 - **jq_schema** (<code>str | None</code>) – Optional jq filter string to extract content. 964 If not specified, whole JSON object will be used to extract information. 965 - **content_key** (<code>str | None</code>) – Optional key to extract document content. 966 If `jq_schema` is specified, the `content_key` will be extracted from that object. 967 - **extra_meta_fields** (<code>set\[str\] | Literal['\*'] | None</code>) – An optional set of meta keys to extract from the content. 968 If `jq_schema` is specified, all keys will be extracted from that object. 969 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 970 If False, only the file name is stored. 971 972 #### to_dict 973 974 ```python 975 to_dict() -> dict[str, Any] 976 ``` 977 978 Serializes the component to a dictionary. 979 980 **Returns:** 981 982 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 983 984 #### from_dict 985 986 ```python 987 from_dict(data: dict[str, Any]) -> JSONConverter 988 ``` 989 990 Deserializes the component from a dictionary. 991 992 **Parameters:** 993 994 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 995 996 **Returns:** 997 998 - <code>JSONConverter</code> – Deserialized component. 999 1000 #### run 1001 1002 ```python 1003 run( 1004 sources: list[str | Path | ByteStream], 1005 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1006 ) -> dict[str, Any] 1007 ``` 1008 1009 Converts a list of JSON files to documents. 1010 1011 **Parameters:** 1012 1013 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or ByteStream objects. 1014 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1015 This value can be either a list of dictionaries or a single dictionary. 1016 If it's a single dictionary, its content is added to the metadata of all produced documents. 1017 If it's a list, the length of the list must match the number of sources. 1018 If `sources` contain ByteStream objects, their `meta` will be added to the output documents. 1019 1020 **Returns:** 1021 1022 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1023 - `documents`: A list of created documents. 1024 1025 ## markdown 1026 1027 ### MarkdownToDocument 1028 1029 Converts a Markdown file into a text Document. 1030 1031 Usage example: 1032 1033 ```python 1034 from haystack.components.converters import MarkdownToDocument 1035 from datetime import datetime 1036 1037 converter = MarkdownToDocument() 1038 results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()}) 1039 documents = results["documents"] 1040 print(documents[0].content) 1041 # 'This is a text from the markdown file.' 1042 ``` 1043 1044 #### __init__ 1045 1046 ```python 1047 __init__( 1048 table_to_single_line: bool = False, 1049 progress_bar: bool = True, 1050 store_full_path: bool = False, 1051 ) -> None 1052 ``` 1053 1054 Create a MarkdownToDocument component. 1055 1056 **Parameters:** 1057 1058 - **table_to_single_line** (<code>bool</code>) – If True converts table contents into a single line. 1059 - **progress_bar** (<code>bool</code>) – If True shows a progress bar when running. 1060 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1061 If False, only the file name is stored. 1062 1063 #### run 1064 1065 ```python 1066 run( 1067 sources: list[str | Path | ByteStream], 1068 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1069 ) -> dict[str, Any] 1070 ``` 1071 1072 Converts a list of Markdown files to Documents. 1073 1074 **Parameters:** 1075 1076 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1077 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1078 This value can be either a list of dictionaries or a single dictionary. 1079 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1080 If it's a list, the length of the list must match the number of sources, because the two lists will 1081 be zipped. 1082 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1083 1084 **Returns:** 1085 1086 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1087 - `documents`: List of created Documents 1088 1089 ## msg 1090 1091 ### MSGToDocument 1092 1093 Converts Microsoft Outlook .msg files into Haystack Documents. 1094 1095 This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg 1096 files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg 1097 file are extracted as ByteStream objects. 1098 1099 ### Example Usage 1100 1101 ```python 1102 from haystack.components.converters.msg import MSGToDocument 1103 from datetime import datetime 1104 1105 converter = MSGToDocument() 1106 results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()}) 1107 documents = results["documents"] 1108 attachments = results["attachments"] 1109 print(documents[0].content) 1110 ``` 1111 1112 #### __init__ 1113 1114 ```python 1115 __init__(store_full_path: bool = False) -> None 1116 ``` 1117 1118 Creates a MSGToDocument component. 1119 1120 **Parameters:** 1121 1122 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1123 If False, only the file name is stored. 1124 1125 #### run 1126 1127 ```python 1128 run( 1129 sources: list[str | Path | ByteStream], 1130 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1131 ) -> dict[str, list[Document] | list[ByteStream]] 1132 ``` 1133 1134 Converts MSG files to Documents. 1135 1136 **Parameters:** 1137 1138 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1139 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1140 This value can be either a list of dictionaries or a single dictionary. 1141 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1142 If it's a list, the length of the list must match the number of sources, because the two lists will 1143 be zipped. 1144 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1145 1146 **Returns:** 1147 1148 - <code>dict\[str, list\[Document\] | list\[ByteStream\]\]</code> – A dictionary with the following keys: 1149 - `documents`: Created Documents. 1150 - `attachments`: Created ByteStream objects from file attachments. 1151 1152 ## multi_file_converter 1153 1154 ### MultiFileConverter 1155 1156 A file converter that handles conversion of multiple file types. 1157 1158 The MultiFileConverter handles the following file types: 1159 1160 - CSV 1161 - DOCX 1162 - HTML 1163 - JSON 1164 - MD 1165 - TEXT 1166 - PDF (no OCR) 1167 - PPTX 1168 - XLSX 1169 1170 Usage example: 1171 1172 ``` 1173 from haystack.super_components.converters import MultiFileConverter 1174 1175 converter = MultiFileConverter() 1176 converter.run(sources=["test.txt", "test.pdf"], meta={}) 1177 ``` 1178 1179 #### __init__ 1180 1181 ```python 1182 __init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None 1183 ``` 1184 1185 Initialize the MultiFileConverter. 1186 1187 **Parameters:** 1188 1189 - **encoding** (<code>str</code>) – The encoding to use when reading files. 1190 - **json_content_key** (<code>str</code>) – The key to use in a content field in a document when converting JSON files. 1191 1192 ## openapi_functions 1193 1194 ### OpenAPIServiceToFunctions 1195 1196 Converts OpenAPI service definitions to a format suitable for OpenAI function calling. 1197 1198 The definition must respect OpenAPI specification 3.0.0 or higher. 1199 It can be specified in JSON or YAML format. 1200 Each function must have: 1201 \- unique operationId 1202 \- description 1203 \- requestBody and/or parameters 1204 \- schema for the requestBody and/or parameters 1205 For more details on OpenAPI specification see the [official documentation](https://github.com/OAI/OpenAPI-Specification). 1206 For more details on OpenAI function calling see the [official documentation](https://platform.openai.com/docs/guides/function-calling). 1207 1208 Usage example: 1209 1210 ```python 1211 from haystack.components.converters import OpenAPIServiceToFunctions 1212 from haystack.dataclasses.byte_stream import ByteStream 1213 1214 converter = OpenAPIServiceToFunctions() 1215 spec = ByteStream.from_string( 1216 '{"openapi":"3.0.0","info":{"title":"API","version":"1.0.0"},"paths":{"/search":{"get":{"operationId":"search","summary":"Search","parameters":[{"name":"q","in":"query","required":true,"schema":{"type":"string"}}]}}}}' 1217 ) 1218 result = converter.run(sources=[spec]) 1219 assert result["functions"] 1220 ``` 1221 1222 #### __init__ 1223 1224 ```python 1225 __init__() -> None 1226 ``` 1227 1228 Create an OpenAPIServiceToFunctions component. 1229 1230 #### run 1231 1232 ```python 1233 run(sources: list[str | Path | ByteStream]) -> dict[str, Any] 1234 ``` 1235 1236 Converts OpenAPI definitions in OpenAI function calling format. 1237 1238 **Parameters:** 1239 1240 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format). 1241 1242 **Returns:** 1243 1244 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1245 - functions: Function definitions in JSON object format 1246 - openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references 1247 1248 **Raises:** 1249 1250 - <code>RuntimeError</code> – If the OpenAPI definitions cannot be downloaded or processed. 1251 - <code>ValueError</code> – If the source type is not recognized or no functions are found in the OpenAPI definitions. 1252 1253 ## output_adapter 1254 1255 ### OutputAdaptationException 1256 1257 Bases: <code>Exception</code> 1258 1259 Exception raised when there is an error during output adaptation. 1260 1261 ### OutputAdapter 1262 1263 Adapts output of a Component using Jinja templates. 1264 1265 Usage example: 1266 1267 ```python 1268 from haystack import Document 1269 from haystack.components.converters import OutputAdapter 1270 1271 adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str) 1272 documents = [Document(content="Test content")] 1273 result = adapter.run(documents=documents) 1274 1275 assert result["output"] == "Test content" 1276 ``` 1277 1278 #### __init__ 1279 1280 ```python 1281 __init__( 1282 template: str, 1283 output_type: TypeAlias, 1284 custom_filters: dict[str, Callable] | None = None, 1285 unsafe: bool = False, 1286 ) -> None 1287 ``` 1288 1289 Create an OutputAdapter component. 1290 1291 **Parameters:** 1292 1293 - **template** (<code>str</code>) – A Jinja template that defines how to adapt the input data. 1294 The variables in the template define the input of this instance. 1295 e.g. 1296 With this template: 1297 1298 ``` 1299 {{ documents[0].content }} 1300 ``` 1301 1302 The Component input will be `documents`. 1303 1304 - **output_type** (<code>TypeAlias</code>) – The type of output this instance will return. 1305 - **custom_filters** (<code>dict\[str, Callable\] | None</code>) – A dictionary of custom Jinja filters used in the template. 1306 - **unsafe** (<code>bool</code>) – Enable execution of arbitrary code in the Jinja template. 1307 This should only be used if you trust the source of the template as it can be lead to remote code execution. 1308 1309 #### run 1310 1311 ```python 1312 run(**kwargs: Any) -> dict[str, Any] 1313 ``` 1314 1315 Renders the Jinja template with the provided inputs. 1316 1317 **Parameters:** 1318 1319 - **kwargs** (<code>Any</code>) – Must contain all variables used in the `template` string. 1320 1321 **Returns:** 1322 1323 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1324 - `output`: Rendered Jinja template. 1325 1326 **Raises:** 1327 1328 - <code>OutputAdaptationException</code> – If template rendering fails. 1329 1330 #### to_dict 1331 1332 ```python 1333 to_dict() -> dict[str, Any] 1334 ``` 1335 1336 Serializes the component to a dictionary. 1337 1338 **Returns:** 1339 1340 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1341 1342 #### from_dict 1343 1344 ```python 1345 from_dict(data: dict[str, Any]) -> OutputAdapter 1346 ``` 1347 1348 Deserializes the component from a dictionary. 1349 1350 **Parameters:** 1351 1352 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 1353 1354 **Returns:** 1355 1356 - <code>OutputAdapter</code> – The deserialized component. 1357 1358 ## pdfminer 1359 1360 ### PDFMinerToDocument 1361 1362 Converts PDF files to Documents. 1363 1364 Uses `pdfminer` compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/ 1365 1366 Usage example: 1367 1368 ```python 1369 from haystack.components.converters.pdfminer import PDFMinerToDocument 1370 1371 converter = PDFMinerToDocument() 1372 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1373 documents = results["documents"] 1374 print(documents[0].content) 1375 # 'This is a text from the PDF file.' 1376 ``` 1377 1378 #### __init__ 1379 1380 ```python 1381 __init__( 1382 line_overlap: float = 0.5, 1383 char_margin: float = 2.0, 1384 line_margin: float = 0.5, 1385 word_margin: float = 0.1, 1386 boxes_flow: float | None = 0.5, 1387 detect_vertical: bool = True, 1388 all_texts: bool = False, 1389 store_full_path: bool = False, 1390 ) -> None 1391 ``` 1392 1393 Create a PDFMinerToDocument component. 1394 1395 **Parameters:** 1396 1397 - **line_overlap** (<code>float</code>) – This parameter determines whether two characters are considered to be on 1398 the same line based on the amount of overlap between them. 1399 The overlap is calculated relative to the minimum height of both characters. 1400 - **char_margin** (<code>float</code>) – Determines whether two characters are part of the same line based on the distance between them. 1401 If the distance is less than the margin specified, the characters are considered to be on the same line. 1402 The margin is calculated relative to the width of the character. 1403 - **word_margin** (<code>float</code>) – Determines whether two characters on the same line are part of the same word 1404 based on the distance between them. If the distance is greater than the margin specified, 1405 an intermediate space will be added between them to make the text more readable. 1406 The margin is calculated relative to the width of the character. 1407 - **line_margin** (<code>float</code>) – This parameter determines whether two lines are part of the same paragraph based on 1408 the distance between them. If the distance is less than the margin specified, 1409 the lines are considered to be part of the same paragraph. 1410 The margin is calculated relative to the height of a line. 1411 - **boxes_flow** (<code>float | None</code>) – This parameter determines the importance of horizontal and vertical position when 1412 determining the order of text boxes. A value between -1.0 and +1.0 can be set, 1413 with -1.0 indicating that only horizontal position matters and +1.0 indicating 1414 that only vertical position matters. Setting the value to 'None' will disable advanced 1415 layout analysis, and text boxes will be ordered based on the position of their bottom left corner. 1416 - **detect_vertical** (<code>bool</code>) – This parameter determines whether vertical text should be considered during layout analysis. 1417 - **all_texts** (<code>bool</code>) – If layout analysis should be performed on text in figures. 1418 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1419 If False, only the file name is stored. 1420 1421 #### detect_undecoded_cid_characters 1422 1423 ```python 1424 detect_undecoded_cid_characters(text: str) -> dict[str, Any] 1425 ``` 1426 1427 Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format. 1428 1429 This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses 1430 non-standard fonts. 1431 1432 A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like 1433 searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor 1434 needs. If that map is not available the text extractor cannot decode the CID characters and will return them 1435 as is. 1436 1437 see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output 1438 1439 **Parameters:** 1440 1441 - **text** (<code>str</code>) – The text to check for undecoded CID characters 1442 1443 **Returns:** 1444 1445 - <code>dict\[str, Any\]</code> – A dictionary containing detection results 1446 1447 #### run 1448 1449 ```python 1450 run( 1451 sources: list[str | Path | ByteStream], 1452 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1453 ) -> dict[str, Any] 1454 ``` 1455 1456 Converts PDF files to Documents. 1457 1458 **Parameters:** 1459 1460 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of PDF file paths or ByteStream objects. 1461 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1462 This value can be either a list of dictionaries or a single dictionary. 1463 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1464 If it's a list, the length of the list must match the number of sources, because the two lists will 1465 be zipped. 1466 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1467 1468 **Returns:** 1469 1470 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1471 - `documents`: Created Documents 1472 1473 ## pptx 1474 1475 ### PPTXToDocument 1476 1477 Converts PPTX files to Documents. 1478 1479 Usage example: 1480 1481 ```python 1482 from haystack.components.converters.pptx import PPTXToDocument 1483 1484 converter = PPTXToDocument() 1485 results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()}) 1486 documents = results["documents"] 1487 print(documents[0].content) 1488 # 'This is the text from the PPTX file.' 1489 ``` 1490 1491 #### __init__ 1492 1493 ```python 1494 __init__( 1495 store_full_path: bool = False, 1496 link_format: Literal["markdown", "plain", "none"] = "none", 1497 ) -> None 1498 ``` 1499 1500 Create a PPTXToDocument component. 1501 1502 **Parameters:** 1503 1504 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1505 If False, only the file name is stored. 1506 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1507 - `"markdown"`: `[text](url)` 1508 - `"plain"`: `text (url)` 1509 - `"none"`: Only the text is extracted, link addresses are ignored. 1510 1511 #### to_dict 1512 1513 ```python 1514 to_dict() -> dict[str, Any] 1515 ``` 1516 1517 Serializes the component to a dictionary. 1518 1519 **Returns:** 1520 1521 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1522 1523 #### run 1524 1525 ```python 1526 run( 1527 sources: list[str | Path | ByteStream], 1528 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1529 ) -> dict[str, Any] 1530 ``` 1531 1532 Converts PPTX files to Documents. 1533 1534 **Parameters:** 1535 1536 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1537 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1538 This value can be either a list of dictionaries or a single dictionary. 1539 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1540 If it's a list, the length of the list must match the number of sources, because the two lists will 1541 be zipped. 1542 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1543 1544 **Returns:** 1545 1546 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1547 - `documents`: Created Documents 1548 1549 ## pypdf 1550 1551 ### PyPDFExtractionMode 1552 1553 Bases: <code>Enum</code> 1554 1555 The mode to use for extracting text from a PDF. 1556 1557 #### from_str 1558 1559 ```python 1560 from_str(string: str) -> PyPDFExtractionMode 1561 ``` 1562 1563 Convert a string to a PyPDFExtractionMode enum. 1564 1565 ### PyPDFToDocument 1566 1567 Converts PDF files to documents your pipeline can query. 1568 1569 This component uses the PyPDF library. 1570 You can attach metadata to the resulting documents. 1571 1572 ### Usage example 1573 1574 ```python 1575 from haystack.components.converters.pypdf import PyPDFToDocument 1576 1577 converter = PyPDFToDocument() 1578 results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()}) 1579 documents = results["documents"] 1580 print(documents[0].content) 1581 # 'This is a text from the PDF file.' 1582 ``` 1583 1584 #### __init__ 1585 1586 ```python 1587 __init__( 1588 *, 1589 extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN, 1590 plain_mode_orientations: tuple = (0, 90, 180, 270), 1591 plain_mode_space_width: float = 200.0, 1592 layout_mode_space_vertically: bool = True, 1593 layout_mode_scale_weight: float = 1.25, 1594 layout_mode_strip_rotated: bool = True, 1595 layout_mode_font_height_weight: float = 1.0, 1596 store_full_path: bool = False 1597 ) -> None 1598 ``` 1599 1600 Create an PyPDFToDocument component. 1601 1602 **Parameters:** 1603 1604 - **extraction_mode** (<code>str | PyPDFExtractionMode</code>) – The mode to use for extracting text from a PDF. 1605 Layout mode is an experimental mode that adheres to the rendered layout of the PDF. 1606 - **plain_mode_orientations** (<code>tuple</code>) – Tuple of orientations to look for when extracting text from a PDF in plain mode. 1607 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1608 - **plain_mode_space_width** (<code>float</code>) – Forces default space width if not extracted from font. 1609 Ignored if `extraction_mode` is `PyPDFExtractionMode.LAYOUT`. 1610 - **layout_mode_space_vertically** (<code>bool</code>) – Whether to include blank lines inferred from y distance + font height. 1611 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1612 - **layout_mode_scale_weight** (<code>float</code>) – Multiplier for string length when calculating weighted average character width. 1613 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1614 - **layout_mode_strip_rotated** (<code>bool</code>) – Layout mode does not support rotated text. Set to `False` to include rotated text anyway. 1615 If rotated text is discovered, layout will be degraded and a warning will be logged. 1616 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1617 - **layout_mode_font_height_weight** (<code>float</code>) – Multiplier for font height when calculating blank line height. 1618 Ignored if `extraction_mode` is `PyPDFExtractionMode.PLAIN`. 1619 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1620 If False, only the file name is stored. 1621 1622 #### to_dict 1623 1624 ```python 1625 to_dict() -> dict[str, Any] 1626 ``` 1627 1628 Serializes the component to a dictionary. 1629 1630 **Returns:** 1631 1632 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1633 1634 #### from_dict 1635 1636 ```python 1637 from_dict(data: dict[str, Any]) -> PyPDFToDocument 1638 ``` 1639 1640 Deserializes the component from a dictionary. 1641 1642 **Parameters:** 1643 1644 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 1645 1646 **Returns:** 1647 1648 - <code>PyPDFToDocument</code> – Deserialized component. 1649 1650 #### run 1651 1652 ```python 1653 run( 1654 sources: list[str | Path | ByteStream], 1655 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1656 ) -> dict[str, list[Document]] 1657 ``` 1658 1659 Converts PDF files to documents. 1660 1661 **Parameters:** 1662 1663 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects to convert. 1664 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1665 This value can be a list of dictionaries or a single dictionary. 1666 If it's a single dictionary, its content is added to the metadata of all produced documents. 1667 If it's a list, its length must match the number of sources, as they are zipped together. 1668 For ByteStream objects, their `meta` is added to the output documents. 1669 1670 **Returns:** 1671 1672 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1673 - `documents`: A list of converted documents. 1674 1675 ## tika 1676 1677 ### XHTMLParser 1678 1679 Bases: <code>HTMLParser</code> 1680 1681 Custom parser to extract pages from Tika XHTML content. 1682 1683 #### handle_starttag 1684 1685 ```python 1686 handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None 1687 ``` 1688 1689 Identify the start of a page div. 1690 1691 #### handle_endtag 1692 1693 ```python 1694 handle_endtag(tag: str) -> None 1695 ``` 1696 1697 Identify the end of a page div. 1698 1699 #### handle_data 1700 1701 ```python 1702 handle_data(data: str) -> None 1703 ``` 1704 1705 Populate the page content. 1706 1707 ### TikaDocumentConverter 1708 1709 Converts files of different types to Documents. 1710 1711 This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, 1712 requires a running Tika server. 1713 For more options on running Tika, 1714 see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). 1715 1716 Usage example: 1717 1718 ```python 1719 from haystack.components.converters.tika import TikaDocumentConverter 1720 1721 converter = TikaDocumentConverter() 1722 results = converter.run( 1723 sources=["sample.docx", "my_document.rtf", "archive.zip"], 1724 meta={"date_added": datetime.now().isoformat()} 1725 ) 1726 documents = results["documents"] 1727 print(documents[0].content) 1728 # 'This is a text from the docx file.' 1729 ``` 1730 1731 #### __init__ 1732 1733 ```python 1734 __init__( 1735 tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False 1736 ) -> None 1737 ``` 1738 1739 Create a TikaDocumentConverter component. 1740 1741 **Parameters:** 1742 1743 - **tika_url** (<code>str</code>) – Tika server URL. 1744 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1745 If False, only the file name is stored. 1746 1747 #### run 1748 1749 ```python 1750 run( 1751 sources: list[str | Path | ByteStream], 1752 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1753 ) -> dict[str, list[Document]] 1754 ``` 1755 1756 Converts files to Documents. 1757 1758 **Parameters:** 1759 1760 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of HTML file paths or ByteStream objects. 1761 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. 1762 This value can be either a list of dictionaries or a single dictionary. 1763 If it's a single dictionary, its content is added to the metadata of all produced Documents. 1764 If it's a list, the length of the list must match the number of sources, because the two lists will 1765 be zipped. 1766 If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. 1767 1768 **Returns:** 1769 1770 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1771 - `documents`: Created Documents 1772 1773 ## txt 1774 1775 ### TextFileToDocument 1776 1777 Converts text files to documents your pipeline can query. 1778 1779 By default, it uses UTF-8 encoding when converting files but 1780 you can also set custom encoding. 1781 It can attach metadata to the resulting documents. 1782 1783 ### Usage example 1784 1785 ```python 1786 from haystack.components.converters.txt import TextFileToDocument 1787 1788 converter = TextFileToDocument() 1789 results = converter.run(sources=["sample.txt"]) 1790 documents = results["documents"] 1791 print(documents[0].content) 1792 # 'This is the content from the txt file.' 1793 ``` 1794 1795 #### __init__ 1796 1797 ```python 1798 __init__(encoding: str = 'utf-8', store_full_path: bool = False) -> None 1799 ``` 1800 1801 Creates a TextFileToDocument component. 1802 1803 **Parameters:** 1804 1805 - **encoding** (<code>str</code>) – The encoding of the text files to convert. 1806 If the encoding is specified in the metadata of a source ByteStream, 1807 it overrides this value. 1808 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1809 If False, only the file name is stored. 1810 1811 #### run 1812 1813 ```python 1814 run( 1815 sources: list[str | Path | ByteStream], 1816 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1817 ) -> dict[str, list[Document]] 1818 ``` 1819 1820 Converts text files to documents. 1821 1822 **Parameters:** 1823 1824 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of text file paths or ByteStream objects to convert. 1825 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1826 This value can be a list of dictionaries or a single dictionary. 1827 If it's a single dictionary, its content is added to the metadata of all produced documents. 1828 If it's a list, its length must match the number of sources as they're zipped together. 1829 For ByteStream objects, their `meta` is added to the output documents. 1830 1831 **Returns:** 1832 1833 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1834 - `documents`: A list of converted documents. 1835 1836 ## xlsx 1837 1838 ### XLSXToDocument 1839 1840 Converts XLSX (Excel) files into Documents. 1841 1842 Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is 1843 created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format. 1844 1845 ### Usage example 1846 1847 ```python 1848 from haystack.components.converters.xlsx import XLSXToDocument 1849 from datetime import datetime 1850 1851 converter = XLSXToDocument() 1852 results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()}) 1853 documents = results["documents"] 1854 print(documents[0].content) 1855 # ",A,B\n1,col_a,col_b\n2,1.5,test\n" 1856 ``` 1857 1858 #### __init__ 1859 1860 ```python 1861 __init__( 1862 table_format: Literal["csv", "markdown"] = "csv", 1863 sheet_name: str | int | list[str | int] | None = None, 1864 read_excel_kwargs: dict[str, Any] | None = None, 1865 table_format_kwargs: dict[str, Any] | None = None, 1866 *, 1867 link_format: Literal["markdown", "plain", "none"] = "none", 1868 store_full_path: bool = False 1869 ) -> None 1870 ``` 1871 1872 Creates a XLSXToDocument component. 1873 1874 **Parameters:** 1875 1876 - **table_format** (<code>Literal['csv', 'markdown']</code>) – The format to convert the Excel file to. 1877 - **sheet_name** (<code>str | int | list\[str | int\] | None</code>) – The name of the sheet to read. If None, all sheets are read. 1878 - **read_excel_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional arguments to pass to `pandas.read_excel`. 1879 See https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel 1880 - **table_format_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to the table format function. 1881 - If `table_format` is "csv", these arguments are passed to `pandas.DataFrame.to_csv`. 1882 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv 1883 - If `table_format` is "markdown", these arguments are passed to `pandas.DataFrame.to_markdown`. 1884 See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown 1885 - **link_format** (<code>Literal['markdown', 'plain', 'none']</code>) – The format for link output. Possible options: 1886 - `"markdown"`: `[text](url)` 1887 - `"plain"`: `text (url)` 1888 - `"none"`: Only the text is extracted, link addresses are ignored. 1889 - **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. 1890 If False, only the file name is stored. 1891 1892 #### run 1893 1894 ```python 1895 run( 1896 sources: list[str | Path | ByteStream], 1897 meta: dict[str, Any] | list[dict[str, Any]] | None = None, 1898 ) -> dict[str, list[Document]] 1899 ``` 1900 1901 Converts a XLSX file to a Document. 1902 1903 **Parameters:** 1904 1905 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. 1906 - **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the documents. 1907 This value can be either a list of dictionaries or a single dictionary. 1908 If it's a single dictionary, its content is added to the metadata of all produced documents. 1909 If it's a list, the length of the list must match the number of sources, because the two lists will 1910 be zipped. 1911 If `sources` contains ByteStream objects, their `meta` will be added to the output documents. 1912 1913 **Returns:** 1914 1915 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1916 - `documents`: Created documents