preprocessors_api.md
1 --- 2 title: "PreProcessors" 3 id: preprocessors-api 4 description: "Preprocess your Documents and texts. Clean, split, and more." 5 slug: "/preprocessors-api" 6 --- 7 8 <a id="csv_document_cleaner"></a> 9 10 ## Module csv\_document\_cleaner 11 12 <a id="csv_document_cleaner.CSVDocumentCleaner"></a> 13 14 ### CSVDocumentCleaner 15 16 A component for cleaning CSV documents by removing empty rows and columns. 17 18 This component processes CSV content stored in Documents, allowing 19 for the optional ignoring of a specified number of rows and columns before performing 20 the cleaning operation. Additionally, it provides options to keep document IDs and 21 control whether empty rows and columns should be removed. 22 23 <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a> 24 25 #### CSVDocumentCleaner.\_\_init\_\_ 26 27 ```python 28 def __init__(*, 29 ignore_rows: int = 0, 30 ignore_columns: int = 0, 31 remove_empty_rows: bool = True, 32 remove_empty_columns: bool = True, 33 keep_id: bool = False) -> None 34 ``` 35 36 Initializes the CSVDocumentCleaner component. 37 38 **Arguments**: 39 40 - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing. 41 - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing. 42 - `remove_empty_rows`: Whether to remove rows that are entirely empty. 43 - `remove_empty_columns`: Whether to remove columns that are entirely empty. 44 - `keep_id`: Whether to retain the original document ID in the output document. 45 Rows and columns ignored using these parameters are preserved in the final output, meaning 46 they are not considered when removing empty rows and columns. 47 48 <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a> 49 50 #### CSVDocumentCleaner.run 51 52 ```python 53 @component.output_types(documents=list[Document]) 54 def run(documents: list[Document]) -> dict[str, list[Document]] 55 ``` 56 57 Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns. 58 59 **Arguments**: 60 61 - `documents`: List of Documents containing CSV-formatted content. 62 63 **Returns**: 64 65 A dictionary with a list of cleaned Documents under the key "documents". 66 Processing steps: 67 1. Reads each document's content as a CSV table. 68 2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left. 69 3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and 70 `remove_empty_columns`). 71 4. Reattaches the ignored rows and columns to maintain their original positions. 72 5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original 73 document ID. 74 75 <a id="csv_document_splitter"></a> 76 77 ## Module csv\_document\_splitter 78 79 <a id="csv_document_splitter.CSVDocumentSplitter"></a> 80 81 ### CSVDocumentSplitter 82 83 A component for splitting CSV documents into sub-tables based on split arguments. 84 85 The splitter supports two modes of operation: 86 - identify consecutive empty rows or columns that exceed a given threshold 87 and uses them as delimiters to segment the document into smaller tables. 88 - split each row into a separate sub-table, represented as a Document. 89 90 <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a> 91 92 #### CSVDocumentSplitter.\_\_init\_\_ 93 94 ```python 95 def __init__(row_split_threshold: int | None = 2, 96 column_split_threshold: int | None = 2, 97 read_csv_kwargs: dict[str, Any] | None = None, 98 split_mode: SplitMode = "threshold") -> None 99 ``` 100 101 Initializes the CSVDocumentSplitter component. 102 103 **Arguments**: 104 105 - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split. 106 - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split. 107 - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`. 108 By default, the component with options: 109 - `header=None` 110 - `skip_blank_lines=False` to preserve blank lines 111 - `dtype=object` to prevent type inference (e.g., converting numbers to floats). 112 See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information. 113 - `split_mode`: If `threshold`, the component will split the document based on the number of 114 consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`. 115 If `row-wise`, the component will split each row into a separate sub-table. 116 117 <a id="csv_document_splitter.CSVDocumentSplitter.run"></a> 118 119 #### CSVDocumentSplitter.run 120 121 ```python 122 @component.output_types(documents=list[Document]) 123 def run(documents: list[Document]) -> dict[str, list[Document]] 124 ``` 125 126 Processes and splits a list of CSV documents into multiple sub-tables. 127 128 **Splitting Process:** 129 1. Applies a row-based split if `row_split_threshold` is provided. 130 2. Applies a column-based split if `column_split_threshold` is provided. 131 3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring 132 further fragmentation of any sub-tables that still contain empty sections. 133 4. Sorts the resulting sub-tables based on their original positions within the document. 134 135 **Arguments**: 136 137 - `documents`: A list of Documents containing CSV-formatted content. 138 Each document is assumed to contain one or more tables separated by empty rows or columns. 139 140 **Returns**: 141 142 A dictionary with a key `"documents"`, mapping to a list of new `Document` objects, 143 each representing an extracted sub-table from the original CSV. 144 The metadata of each document includes: 145 - A field `source_id` to track the original document. 146 - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table. 147 - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table. 148 - A field `split_id` to indicate the order of the split in the original document. 149 - All other metadata copied from the original document. 150 151 - If a document cannot be processed, it is returned unchanged. 152 - The `meta` field from the original document is preserved in the split documents. 153 154 <a id="document_cleaner"></a> 155 156 ## Module document\_cleaner 157 158 <a id="document_cleaner.DocumentCleaner"></a> 159 160 ### DocumentCleaner 161 162 Cleans the text in the documents. 163 164 It removes extra whitespaces, 165 empty lines, specified substrings, regexes, 166 page headers and footers (in this order). 167 168 ### Usage example: 169 170 ```python 171 from haystack import Document 172 from haystack.components.preprocessors import DocumentCleaner 173 174 doc = Document(content="This is a document to clean\n\n\nsubstring to remove") 175 176 cleaner = DocumentCleaner(remove_substrings = ["substring to remove"]) 177 result = cleaner.run(documents=[doc]) 178 179 assert result["documents"][0].content == "This is a document to clean " 180 ``` 181 182 <a id="document_cleaner.DocumentCleaner.__init__"></a> 183 184 #### DocumentCleaner.\_\_init\_\_ 185 186 ```python 187 def __init__(remove_empty_lines: bool = True, 188 remove_extra_whitespaces: bool = True, 189 remove_repeated_substrings: bool = False, 190 keep_id: bool = False, 191 remove_substrings: list[str] | None = None, 192 remove_regex: str | None = None, 193 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 194 | None = None, 195 ascii_only: bool = False, 196 strip_whitespaces: bool = False, 197 replace_regexes: dict[str, str] | None = None) 198 ``` 199 200 Initialize DocumentCleaner. 201 202 **Arguments**: 203 204 - `remove_empty_lines`: If `True`, removes empty lines. 205 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 206 - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages. 207 Pages must be separated by a form feed character "\f", 208 which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`. 209 - `remove_substrings`: List of substrings to remove from the text. 210 - `remove_regex`: Regex to match and replace substrings by "". 211 - `keep_id`: If `True`, keeps the IDs of the original documents. 212 - `unicode_normalization`: Unicode normalization form to apply to the text. 213 Note: This will run before any other steps. 214 - `ascii_only`: Whether to convert the text to ASCII only. 215 Will remove accents from characters and replace them with ASCII characters. 216 Other non-ASCII characters will be removed. 217 Note: This will run before any pattern matching or removal. 218 - `strip_whitespaces`: If `True`, removes leading and trailing whitespace from the document content 219 using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning 220 and end of the text, preserving internal whitespace (useful for markdown formatting). 221 - `replace_regexes`: A dictionary mapping regex patterns to their replacement strings. 222 For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline. 223 This is applied after `remove_regex` and allows custom replacements instead of just removal. 224 225 <a id="document_cleaner.DocumentCleaner.run"></a> 226 227 #### DocumentCleaner.run 228 229 ```python 230 @component.output_types(documents=list[Document]) 231 def run(documents: list[Document]) 232 ``` 233 234 Cleans up the documents. 235 236 **Arguments**: 237 238 - `documents`: List of Documents to clean. 239 240 **Raises**: 241 242 - `TypeError`: if documents is not a list of Documents. 243 244 **Returns**: 245 246 A dictionary with the following key: 247 - `documents`: List of cleaned Documents. 248 249 <a id="document_preprocessor"></a> 250 251 ## Module document\_preprocessor 252 253 <a id="document_preprocessor.DocumentPreprocessor"></a> 254 255 ### DocumentPreprocessor 256 257 A SuperComponent that first splits and then cleans documents. 258 259 This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline. 260 It takes a list of documents as input and returns a processed list of documents. 261 262 Usage example: 263 ```python 264 from haystack import Document 265 from haystack.components.preprocessors import DocumentPreprocessor 266 267 doc = Document(content="I love pizza!") 268 preprocessor = DocumentPreprocessor() 269 result = preprocessor.run(documents=[doc]) 270 print(result["documents"]) 271 ``` 272 273 <a id="document_preprocessor.DocumentPreprocessor.__init__"></a> 274 275 #### DocumentPreprocessor.\_\_init\_\_ 276 277 ```python 278 def __init__(*, 279 split_by: Literal["function", "page", "passage", "period", "word", 280 "line", "sentence"] = "word", 281 split_length: int = 250, 282 split_overlap: int = 0, 283 split_threshold: int = 0, 284 splitting_function: Callable[[str], list[str]] | None = None, 285 respect_sentence_boundary: bool = False, 286 language: Language = "en", 287 use_split_rules: bool = True, 288 extend_abbreviations: bool = True, 289 remove_empty_lines: bool = True, 290 remove_extra_whitespaces: bool = True, 291 remove_repeated_substrings: bool = False, 292 keep_id: bool = False, 293 remove_substrings: list[str] | None = None, 294 remove_regex: str | None = None, 295 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 296 | None = None, 297 ascii_only: bool = False) -> None 298 ``` 299 300 Initialize a DocumentPreProcessor that first splits and then cleans documents. 301 302 **Splitter Parameters**: 303 304 **Arguments**: 305 306 - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence". 307 - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split. 308 - `split_overlap`: The number of overlapping units between consecutive splits. 309 - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged 310 with the previous split. 311 - `splitting_function`: A custom function for splitting if `split_by="function"`. 312 - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence. 313 - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or 314 `respect_sentence_boundary=True`. 315 - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter. 316 - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain 317 languages. 318 319 **Cleaner Parameters**: 320 - `remove_empty_lines`: If `True`, removes empty lines. 321 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 322 - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages. 323 - `keep_id`: If `True`, keeps the original document IDs. 324 - `remove_substrings`: A list of strings to remove from the document content. 325 - `remove_regex`: A regex pattern whose matches will be removed from the document content. 326 - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`. 327 - `ascii_only`: If `True`, converts text to ASCII only. 328 329 <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a> 330 331 #### DocumentPreprocessor.to\_dict 332 333 ```python 334 def to_dict() -> dict[str, Any] 335 ``` 336 337 Serialize SuperComponent to a dictionary. 338 339 **Returns**: 340 341 Dictionary with serialized data. 342 343 <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a> 344 345 #### DocumentPreprocessor.from\_dict 346 347 ```python 348 @classmethod 349 def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor" 350 ``` 351 352 Deserializes the SuperComponent from a dictionary. 353 354 **Arguments**: 355 356 - `data`: Dictionary to deserialize from. 357 358 **Returns**: 359 360 Deserialized SuperComponent. 361 362 <a id="document_splitter"></a> 363 364 ## Module document\_splitter 365 366 <a id="document_splitter.DocumentSplitter"></a> 367 368 ### DocumentSplitter 369 370 Splits long documents into smaller chunks. 371 372 This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations 373 and prevents exceeding language model context limits. 374 375 The DocumentSplitter is compatible with the following DocumentStores: 376 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 377 - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is 378 not stored 379 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 380 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 381 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 382 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping 383 information is not stored 384 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 385 - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore) 386 387 ### Usage example 388 389 ```python 390 from haystack import Document 391 from haystack.components.preprocessors import DocumentSplitter 392 393 doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.") 394 395 splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0) 396 result = splitter.run(documents=[doc]) 397 ``` 398 399 <a id="document_splitter.DocumentSplitter.__init__"></a> 400 401 #### DocumentSplitter.\_\_init\_\_ 402 403 ```python 404 def __init__(split_by: Literal["function", "page", "passage", "period", "word", 405 "line", "sentence"] = "word", 406 split_length: int = 200, 407 split_overlap: int = 0, 408 split_threshold: int = 0, 409 splitting_function: Callable[[str], list[str]] | None = None, 410 respect_sentence_boundary: bool = False, 411 language: Language = "en", 412 use_split_rules: bool = True, 413 extend_abbreviations: bool = True, 414 *, 415 skip_empty_documents: bool = True) 416 ``` 417 418 Initialize DocumentSplitter. 419 420 **Arguments**: 421 422 - `split_by`: The unit for splitting your documents. Choose from: 423 - `word` for splitting by spaces (" ") 424 - `period` for splitting by periods (".") 425 - `page` for splitting by form feed ("\f") 426 - `passage` for splitting by double line breaks ("\n\n") 427 - `line` for splitting each line ("\n") 428 - `sentence` for splitting by NLTK sentence tokenizer 429 - `split_length`: The maximum number of units in each split. 430 - `split_overlap`: The number of overlapping units for each split. 431 - `split_threshold`: The minimum number of units per split. If a split has fewer units 432 than the threshold, it's attached to the previous split. 433 - `splitting_function`: Necessary when `split_by` is set to "function". 434 This is a function which must accept a single `str` as input and return a `list` of `str` as output, 435 representing the chunks after splitting. 436 - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word". 437 If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. 438 - `language`: Choose the language for the NLTK tokenizer. The default is English ("en"). 439 - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`. 440 - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list 441 of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). 442 - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True. 443 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 444 from non-textual documents. 445 446 <a id="document_splitter.DocumentSplitter.warm_up"></a> 447 448 #### DocumentSplitter.warm\_up 449 450 ```python 451 def warm_up() 452 ``` 453 454 Warm up the DocumentSplitter by loading the sentence tokenizer. 455 456 <a id="document_splitter.DocumentSplitter.run"></a> 457 458 #### DocumentSplitter.run 459 460 ```python 461 @component.output_types(documents=list[Document]) 462 def run(documents: list[Document]) 463 ``` 464 465 Split documents into smaller parts. 466 467 Splits documents by the unit expressed in `split_by`, with a length of `split_length` 468 and an overlap of `split_overlap`. 469 470 **Arguments**: 471 472 - `documents`: The documents to split. 473 474 **Raises**: 475 476 - `TypeError`: if the input is not a list of Documents. 477 - `ValueError`: if the content of a document is None. 478 479 **Returns**: 480 481 A dictionary with the following key: 482 - `documents`: List of documents with the split texts. Each document includes: 483 - A metadata field `source_id` to track the original document. 484 - A metadata field `page_number` to track the original page number. 485 - All other metadata copied from the original document. 486 487 <a id="document_splitter.DocumentSplitter.to_dict"></a> 488 489 #### DocumentSplitter.to\_dict 490 491 ```python 492 def to_dict() -> dict[str, Any] 493 ``` 494 495 Serializes the component to a dictionary. 496 497 <a id="document_splitter.DocumentSplitter.from_dict"></a> 498 499 #### DocumentSplitter.from\_dict 500 501 ```python 502 @classmethod 503 def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter" 504 ``` 505 506 Deserializes the component from a dictionary. 507 508 <a id="embedding_based_document_splitter"></a> 509 510 ## Module embedding\_based\_document\_splitter 511 512 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter"></a> 513 514 ### EmbeddingBasedDocumentSplitter 515 516 Splits documents based on embedding similarity using cosine distances between sequential sentence groups. 517 518 This component first splits text into sentences, optionally groups them, calculates embeddings for each group, 519 and then uses cosine distance between sequential embeddings to determine split points. Any distance above 520 the specified percentile is treated as a break point. The component also tracks page numbers based on form feed 521 characters (``) in the original document. 522 523 This component is inspired by [5 Levels of Text Splitting]( 524 https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb 525 ) by Greg Kamradt. 526 527 ### Usage example 528 529 ```python 530 from haystack import Document 531 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 532 from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter 533 534 # Create a document with content that has a clear topic shift 535 doc = Document( 536 content="This is a first sentence. This is a second sentence. This is a third sentence. " 537 "Completely different topic. The same completely different topic." 538 ) 539 540 # Initialize the embedder to calculate semantic similarities 541 embedder = SentenceTransformersDocumentEmbedder() 542 543 # Configure the splitter with parameters that control splitting behavior 544 splitter = EmbeddingBasedDocumentSplitter( 545 document_embedder=embedder, 546 sentences_per_group=2, # Group 2 sentences before calculating embeddings 547 percentile=0.95, # Split when cosine distance exceeds 95th percentile 548 min_length=50, # Merge splits shorter than 50 characters 549 max_length=1000 # Further split chunks longer than 1000 characters 550 ) 551 splitter.warm_up() 552 result = splitter.run(documents=[doc]) 553 554 # The result contains a list of Document objects, each representing a semantic chunk 555 # Each split document includes metadata: source_id, split_id, and page_number 556 print(f"Original document split into {len(result['documents'])} chunks") 557 for i, split_doc in enumerate(result['documents']): 558 print(f"Chunk {i}: {split_doc.content[:50]}...") 559 ``` 560 561 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.__init__"></a> 562 563 #### EmbeddingBasedDocumentSplitter.\_\_init\_\_ 564 565 ```python 566 def __init__(*, 567 document_embedder: DocumentEmbedder, 568 sentences_per_group: int = 3, 569 percentile: float = 0.95, 570 min_length: int = 50, 571 max_length: int = 1000, 572 language: Language = "en", 573 use_split_rules: bool = True, 574 extend_abbreviations: bool = True) 575 ``` 576 577 Initialize EmbeddingBasedDocumentSplitter. 578 579 **Arguments**: 580 581 - `document_embedder`: The DocumentEmbedder to use for calculating embeddings. 582 - `sentences_per_group`: Number of sentences to group together before embedding. 583 - `percentile`: Percentile threshold for cosine distance. Distances above this percentile 584 are treated as break points. 585 - `min_length`: Minimum length of splits in characters. Splits below this length will be merged. 586 - `max_length`: Maximum length of splits in characters. Splits above this length will be recursively split. 587 - `language`: Language for sentence tokenization. 588 - `use_split_rules`: Whether to use additional split rules for sentence tokenization. Applies additional 589 split rules from SentenceSplitter to the sentence spans. 590 - `extend_abbreviations`: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list 591 of curated abbreviations. Currently supported languages are: en, de. 592 If False, the default abbreviations are used. 593 594 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.warm_up"></a> 595 596 #### EmbeddingBasedDocumentSplitter.warm\_up 597 598 ```python 599 def warm_up() -> None 600 ``` 601 602 Warm up the component by initializing the sentence splitter. 603 604 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.run"></a> 605 606 #### EmbeddingBasedDocumentSplitter.run 607 608 ```python 609 @component.output_types(documents=list[Document]) 610 def run(documents: list[Document]) -> dict[str, list[Document]] 611 ``` 612 613 Split documents based on embedding similarity. 614 615 **Arguments**: 616 617 - `documents`: The documents to split. 618 619 **Raises**: 620 621 - `None`: - `RuntimeError`: If the component wasn't warmed up. 622 - `TypeError`: If the input is not a list of Documents. 623 - `ValueError`: If the document content is None or empty. 624 625 **Returns**: 626 627 A dictionary with the following key: 628 - `documents`: List of documents with the split texts. Each document includes: 629 - A metadata field `source_id` to track the original document. 630 - A metadata field `split_id` to track the split number. 631 - A metadata field `page_number` to track the original page number. 632 - All other metadata copied from the original document. 633 634 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.to_dict"></a> 635 636 #### EmbeddingBasedDocumentSplitter.to\_dict 637 638 ```python 639 def to_dict() -> dict[str, Any] 640 ``` 641 642 Serializes the component to a dictionary. 643 644 **Returns**: 645 646 Serialized dictionary representation of the component. 647 648 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.from_dict"></a> 649 650 #### EmbeddingBasedDocumentSplitter.from\_dict 651 652 ```python 653 @classmethod 654 def from_dict(cls, data: dict[str, Any]) -> "EmbeddingBasedDocumentSplitter" 655 ``` 656 657 Deserializes the component from a dictionary. 658 659 **Arguments**: 660 661 - `data`: The dictionary to deserialize and create the component. 662 663 **Returns**: 664 665 The deserialized component. 666 667 <a id="hierarchical_document_splitter"></a> 668 669 ## Module hierarchical\_document\_splitter 670 671 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a> 672 673 ### HierarchicalDocumentSplitter 674 675 Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes. 676 677 The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between 678 are connected such that the smaller blocks are children of the parent-larger blocks. 679 680 ## Usage example 681 ```python 682 from haystack import Document 683 from haystack.components.preprocessors import HierarchicalDocumentSplitter 684 685 doc = Document(content="This is a simple test document") 686 splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word") 687 splitter.run([doc]) 688 >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}), 689 >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 690 >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}), 691 >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 692 >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}), 693 >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 694 >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]} 695 ``` 696 697 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a> 698 699 #### HierarchicalDocumentSplitter.\_\_init\_\_ 700 701 ```python 702 def __init__(block_sizes: set[int], 703 split_overlap: int = 0, 704 split_by: Literal["word", "sentence", "page", 705 "passage"] = "word") 706 ``` 707 708 Initialize HierarchicalDocumentSplitter. 709 710 **Arguments**: 711 712 - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order. 713 - `split_overlap`: The number of overlapping units for each split. 714 - `split_by`: The unit for splitting your documents. 715 716 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a> 717 718 #### HierarchicalDocumentSplitter.run 719 720 ```python 721 @component.output_types(documents=list[Document]) 722 def run(documents: list[Document]) 723 ``` 724 725 Builds a hierarchical document structure for each document in a list of documents. 726 727 **Arguments**: 728 729 - `documents`: List of Documents to split into hierarchical blocks. 730 731 **Returns**: 732 733 List of HierarchicalDocument 734 735 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a> 736 737 #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc 738 739 ```python 740 def build_hierarchy_from_doc(document: Document) -> list[Document] 741 ``` 742 743 Build a hierarchical tree document structure from a single document. 744 745 Given a document, this function splits the document into hierarchical blocks of different sizes represented 746 as HierarchicalDocument objects. 747 748 **Arguments**: 749 750 - `document`: Document to split into hierarchical blocks. 751 752 **Returns**: 753 754 List of HierarchicalDocument 755 756 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a> 757 758 #### HierarchicalDocumentSplitter.to\_dict 759 760 ```python 761 def to_dict() -> dict[str, Any] 762 ``` 763 764 Returns a dictionary representation of the component. 765 766 **Returns**: 767 768 Serialized dictionary representation of the component. 769 770 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a> 771 772 #### HierarchicalDocumentSplitter.from\_dict 773 774 ```python 775 @classmethod 776 def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter" 777 ``` 778 779 Deserialize this component from a dictionary. 780 781 **Arguments**: 782 783 - `data`: The dictionary to deserialize and create the component. 784 785 **Returns**: 786 787 The deserialized component. 788 789 <a id="markdown_header_splitter"></a> 790 791 ## Module markdown\_header\_splitter 792 793 <a id="markdown_header_splitter.MarkdownHeaderSplitter"></a> 794 795 ### MarkdownHeaderSplitter 796 797 Split documents at ATX-style Markdown headers (#), with optional secondary splitting. 798 799 This component processes text documents by: 800 - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata. 801 - Optionally applying a secondary split (by word, passage, period, or line) to each chunk 802 (using haystack's DocumentSplitter). 803 - Preserving and propagating metadata such as parent headers, page numbers, and split IDs. 804 805 <a id="markdown_header_splitter.MarkdownHeaderSplitter.__init__"></a> 806 807 #### MarkdownHeaderSplitter.\_\_init\_\_ 808 809 ```python 810 def __init__(*, 811 page_break_character: str = "\f", 812 keep_headers: bool = True, 813 secondary_split: Literal["word", "passage", "period", "line"] 814 | None = None, 815 split_length: int = 200, 816 split_overlap: int = 0, 817 split_threshold: int = 0, 818 skip_empty_documents: bool = True) 819 ``` 820 821 Initialize the MarkdownHeaderSplitter. 822 823 **Arguments**: 824 825 - `page_break_character`: Character used to identify page breaks. Defaults to form feed (""). 826 - `keep_headers`: If True, headers are kept in the content. If False, headers are moved to metadata. 827 Defaults to True. 828 - `secondary_split`: Optional secondary split condition after header splitting. 829 Options are None, "word", "passage", "period", "line". Defaults to None. 830 - `split_length`: The maximum number of units in each split when using secondary splitting. Defaults to 200. 831 - `split_overlap`: The number of overlapping units for each split when using secondary splitting. 832 Defaults to 0. 833 - `split_threshold`: The minimum number of units per split when using secondary splitting. Defaults to 0. 834 - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True. 835 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 836 from non-textual documents. 837 838 <a id="markdown_header_splitter.MarkdownHeaderSplitter.warm_up"></a> 839 840 #### MarkdownHeaderSplitter.warm\_up 841 842 ```python 843 def warm_up() 844 ``` 845 846 Warm up the MarkdownHeaderSplitter. 847 848 <a id="markdown_header_splitter.MarkdownHeaderSplitter.run"></a> 849 850 #### MarkdownHeaderSplitter.run 851 852 ```python 853 @component.output_types(documents=list[Document]) 854 def run(documents: list[Document]) -> dict[str, list[Document]] 855 ``` 856 857 Run the markdown header splitter with optional secondary splitting. 858 859 **Arguments**: 860 861 - `documents`: List of documents to split 862 863 **Returns**: 864 865 A dictionary with the following key: 866 - `documents`: List of documents with the split texts. Each document includes: 867 - A metadata field `source_id` to track the original document. 868 - A metadata field `page_number` to track the original page number. 869 - A metadata field `split_id` to identify the split chunk index within its parent document. 870 - All other metadata copied from the original document. 871 872 <a id="recursive_splitter"></a> 873 874 ## Module recursive\_splitter 875 876 <a id="recursive_splitter.RecursiveDocumentSplitter"></a> 877 878 ### RecursiveDocumentSplitter 879 880 Recursively chunk text into smaller chunks. 881 882 This component is used to split text into smaller chunks, it does so by recursively applying a list of separators 883 to the text. 884 885 The separators are applied in the order they are provided, typically this is a list of separators that are 886 applied in a specific order, being the last separator the most specific one. 887 888 Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that 889 are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the 890 list to the remaining text. 891 892 This is done until all chunks are smaller than the split_length parameter. 893 894 **Example**: 895 896 897 ```python 898 from haystack import Document 899 from haystack.components.preprocessors import RecursiveDocumentSplitter 900 901 chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "]) 902 text = ('''Artificial intelligence (AI) - Introduction 903 904 AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. 905 AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''') 906 chunker.warm_up() 907 doc = Document(content=text) 908 doc_chunks = chunker.run([doc]) 909 print(doc_chunks["documents"]) 910 >[ 911 >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []}) 912 >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []}) 913 >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []}) 914 >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []}) 915 >] 916 ``` 917 918 <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a> 919 920 #### RecursiveDocumentSplitter.\_\_init\_\_ 921 922 ```python 923 def __init__(*, 924 split_length: int = 200, 925 split_overlap: int = 0, 926 split_unit: Literal["word", "char", "token"] = "word", 927 separators: list[str] | None = None, 928 sentence_splitter_params: dict[str, Any] | None = None) 929 ``` 930 931 Initializes a RecursiveDocumentSplitter. 932 933 **Arguments**: 934 935 - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens. 936 See the `split_units` parameter. 937 - `split_overlap`: The number of characters to overlap between consecutive chunks. 938 - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token". 939 If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). 940 - `separators`: An optional list of separator strings to use for splitting the text. The string 941 separators will be treated as regular expressions unless the separator is "sentence", in that case the 942 text will be split into sentences using a custom sentence tokenizer based on NLTK. 943 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter. 944 If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used. 945 - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer. 946 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information. 947 948 **Raises**: 949 950 - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or 951 if any separator is not a string. 952 953 <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a> 954 955 #### RecursiveDocumentSplitter.warm\_up 956 957 ```python 958 def warm_up() -> None 959 ``` 960 961 Warm up the sentence tokenizer and tiktoken tokenizer if needed. 962 963 <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a> 964 965 #### RecursiveDocumentSplitter.run 966 967 ```python 968 @component.output_types(documents=list[Document]) 969 def run(documents: list[Document]) -> dict[str, list[Document]] 970 ``` 971 972 Split a list of documents into documents with smaller chunks of text. 973 974 **Arguments**: 975 976 - `documents`: List of Documents to split. 977 978 **Returns**: 979 980 A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding 981 to the input documents. 982 983 <a id="text_cleaner"></a> 984 985 ## Module text\_cleaner 986 987 <a id="text_cleaner.TextCleaner"></a> 988 989 ### TextCleaner 990 991 Cleans text strings. 992 993 It can remove substrings matching a list of regular expressions, convert text to lowercase, 994 remove punctuation, and remove numbers. 995 Use it to clean up text data before evaluation. 996 997 ### Usage example 998 999 ```python 1000 from haystack.components.preprocessors import TextCleaner 1001 1002 text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything." 1003 1004 cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True) 1005 result = cleaner.run(texts=[text_to_clean]) 1006 ``` 1007 1008 <a id="text_cleaner.TextCleaner.__init__"></a> 1009 1010 #### TextCleaner.\_\_init\_\_ 1011 1012 ```python 1013 def __init__(remove_regexps: list[str] | None = None, 1014 convert_to_lowercase: bool = False, 1015 remove_punctuation: bool = False, 1016 remove_numbers: bool = False) 1017 ``` 1018 1019 Initializes the TextCleaner component. 1020 1021 **Arguments**: 1022 1023 - `remove_regexps`: A list of regex patterns to remove matching substrings from the text. 1024 - `convert_to_lowercase`: If `True`, converts all characters to lowercase. 1025 - `remove_punctuation`: If `True`, removes punctuation from the text. 1026 - `remove_numbers`: If `True`, removes numerical digits from the text. 1027 1028 <a id="text_cleaner.TextCleaner.run"></a> 1029 1030 #### TextCleaner.run 1031 1032 ```python 1033 @component.output_types(texts=list[str]) 1034 def run(texts: list[str]) -> dict[str, Any] 1035 ``` 1036 1037 Cleans up the given list of strings. 1038 1039 **Arguments**: 1040 1041 - `texts`: List of strings to clean. 1042 1043 **Returns**: 1044 1045 A dictionary with the following key: 1046 - `texts`: the cleaned list of strings. 1047