preprocessors_api.md
1 --- 2 title: "PreProcessors" 3 id: preprocessors-api 4 description: "Preprocess your Documents and texts. Clean, split, and more." 5 slug: "/preprocessors-api" 6 --- 7 8 <a id="csv_document_cleaner"></a> 9 10 ## Module csv\_document\_cleaner 11 12 <a id="csv_document_cleaner.CSVDocumentCleaner"></a> 13 14 ### CSVDocumentCleaner 15 16 A component for cleaning CSV documents by removing empty rows and columns. 17 18 This component processes CSV content stored in Documents, allowing 19 for the optional ignoring of a specified number of rows and columns before performing 20 the cleaning operation. Additionally, it provides options to keep document IDs and 21 control whether empty rows and columns should be removed. 22 23 <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a> 24 25 #### CSVDocumentCleaner.\_\_init\_\_ 26 27 ```python 28 def __init__(*, 29 ignore_rows: int = 0, 30 ignore_columns: int = 0, 31 remove_empty_rows: bool = True, 32 remove_empty_columns: bool = True, 33 keep_id: bool = False) -> None 34 ``` 35 36 Initializes the CSVDocumentCleaner component. 37 38 **Arguments**: 39 40 - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing. 41 - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing. 42 - `remove_empty_rows`: Whether to remove rows that are entirely empty. 43 - `remove_empty_columns`: Whether to remove columns that are entirely empty. 44 - `keep_id`: Whether to retain the original document ID in the output document. 45 Rows and columns ignored using these parameters are preserved in the final output, meaning 46 they are not considered when removing empty rows and columns. 47 48 <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a> 49 50 #### CSVDocumentCleaner.run 51 52 ```python 53 @component.output_types(documents=list[Document]) 54 def run(documents: list[Document]) -> dict[str, list[Document]] 55 ``` 56 57 Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns. 58 59 **Arguments**: 60 61 - `documents`: List of Documents containing CSV-formatted content. 62 63 **Returns**: 64 65 A dictionary with a list of cleaned Documents under the key "documents". 66 Processing steps: 67 1. Reads each document's content as a CSV table. 68 2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left. 69 3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and 70 `remove_empty_columns`). 71 4. Reattaches the ignored rows and columns to maintain their original positions. 72 5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original 73 document ID. 74 75 <a id="csv_document_splitter"></a> 76 77 ## Module csv\_document\_splitter 78 79 <a id="csv_document_splitter.CSVDocumentSplitter"></a> 80 81 ### CSVDocumentSplitter 82 83 A component for splitting CSV documents into sub-tables based on split arguments. 84 85 The splitter supports two modes of operation: 86 - identify consecutive empty rows or columns that exceed a given threshold 87 and uses them as delimiters to segment the document into smaller tables. 88 - split each row into a separate sub-table, represented as a Document. 89 90 <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a> 91 92 #### CSVDocumentSplitter.\_\_init\_\_ 93 94 ```python 95 def __init__(row_split_threshold: int | None = 2, 96 column_split_threshold: int | None = 2, 97 read_csv_kwargs: dict[str, Any] | None = None, 98 split_mode: SplitMode = "threshold") -> None 99 ``` 100 101 Initializes the CSVDocumentSplitter component. 102 103 **Arguments**: 104 105 - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split. 106 - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split. 107 - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`. 108 By default, the component with options: 109 - `header=None` 110 - `skip_blank_lines=False` to preserve blank lines 111 - `dtype=object` to prevent type inference (e.g., converting numbers to floats). 112 See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information. 113 - `split_mode`: If `threshold`, the component will split the document based on the number of 114 consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`. 115 If `row-wise`, the component will split each row into a separate sub-table. 116 117 <a id="csv_document_splitter.CSVDocumentSplitter.run"></a> 118 119 #### CSVDocumentSplitter.run 120 121 ```python 122 @component.output_types(documents=list[Document]) 123 def run(documents: list[Document]) -> dict[str, list[Document]] 124 ``` 125 126 Processes and splits a list of CSV documents into multiple sub-tables. 127 128 **Splitting Process:** 129 1. Applies a row-based split if `row_split_threshold` is provided. 130 2. Applies a column-based split if `column_split_threshold` is provided. 131 3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring 132 further fragmentation of any sub-tables that still contain empty sections. 133 4. Sorts the resulting sub-tables based on their original positions within the document. 134 135 **Arguments**: 136 137 - `documents`: A list of Documents containing CSV-formatted content. 138 Each document is assumed to contain one or more tables separated by empty rows or columns. 139 140 **Returns**: 141 142 A dictionary with a key `"documents"`, mapping to a list of new `Document` objects, 143 each representing an extracted sub-table from the original CSV. 144 The metadata of each document includes: 145 - A field `source_id` to track the original document. 146 - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table. 147 - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table. 148 - A field `split_id` to indicate the order of the split in the original document. 149 - All other metadata copied from the original document. 150 151 - If a document cannot be processed, it is returned unchanged. 152 - The `meta` field from the original document is preserved in the split documents. 153 154 <a id="document_cleaner"></a> 155 156 ## Module document\_cleaner 157 158 <a id="document_cleaner.DocumentCleaner"></a> 159 160 ### DocumentCleaner 161 162 Cleans the text in the documents. 163 164 It removes extra whitespaces, 165 empty lines, specified substrings, regexes, 166 page headers and footers (in this order). 167 168 ### Usage example: 169 170 ```python 171 from haystack import Document 172 from haystack.components.preprocessors import DocumentCleaner 173 174 doc = Document(content="This is a document to clean\n\n\nsubstring to remove") 175 176 cleaner = DocumentCleaner(remove_substrings = ["substring to remove"]) 177 result = cleaner.run(documents=[doc]) 178 179 assert result["documents"][0].content == "This is a document to clean " 180 ``` 181 182 <a id="document_cleaner.DocumentCleaner.__init__"></a> 183 184 #### DocumentCleaner.\_\_init\_\_ 185 186 ```python 187 def __init__(remove_empty_lines: bool = True, 188 remove_extra_whitespaces: bool = True, 189 remove_repeated_substrings: bool = False, 190 keep_id: bool = False, 191 remove_substrings: list[str] | None = None, 192 remove_regex: str | None = None, 193 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 194 | None = None, 195 ascii_only: bool = False) 196 ``` 197 198 Initialize DocumentCleaner. 199 200 **Arguments**: 201 202 - `remove_empty_lines`: If `True`, removes empty lines. 203 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 204 - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages. 205 Pages must be separated by a form feed character "\f", 206 which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`. 207 - `remove_substrings`: List of substrings to remove from the text. 208 - `remove_regex`: Regex to match and replace substrings by "". 209 - `keep_id`: If `True`, keeps the IDs of the original documents. 210 - `unicode_normalization`: Unicode normalization form to apply to the text. 211 Note: This will run before any other steps. 212 - `ascii_only`: Whether to convert the text to ASCII only. 213 Will remove accents from characters and replace them with ASCII characters. 214 Other non-ASCII characters will be removed. 215 Note: This will run before any pattern matching or removal. 216 217 <a id="document_cleaner.DocumentCleaner.run"></a> 218 219 #### DocumentCleaner.run 220 221 ```python 222 @component.output_types(documents=list[Document]) 223 def run(documents: list[Document]) 224 ``` 225 226 Cleans up the documents. 227 228 **Arguments**: 229 230 - `documents`: List of Documents to clean. 231 232 **Raises**: 233 234 - `TypeError`: if documents is not a list of Documents. 235 236 **Returns**: 237 238 A dictionary with the following key: 239 - `documents`: List of cleaned Documents. 240 241 <a id="document_preprocessor"></a> 242 243 ## Module document\_preprocessor 244 245 <a id="document_preprocessor.DocumentPreprocessor"></a> 246 247 ### DocumentPreprocessor 248 249 A SuperComponent that first splits and then cleans documents. 250 251 This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline. 252 It takes a list of documents as input and returns a processed list of documents. 253 254 Usage example: 255 ```python 256 from haystack import Document 257 from haystack.components.preprocessors import DocumentPreprocessor 258 259 doc = Document(content="I love pizza!") 260 preprocessor = DocumentPreprocessor() 261 result = preprocessor.run(documents=[doc]) 262 print(result["documents"]) 263 ``` 264 265 <a id="document_preprocessor.DocumentPreprocessor.__init__"></a> 266 267 #### DocumentPreprocessor.\_\_init\_\_ 268 269 ```python 270 def __init__(*, 271 split_by: Literal["function", "page", "passage", "period", "word", 272 "line", "sentence"] = "word", 273 split_length: int = 250, 274 split_overlap: int = 0, 275 split_threshold: int = 0, 276 splitting_function: Callable[[str], list[str]] | None = None, 277 respect_sentence_boundary: bool = False, 278 language: Language = "en", 279 use_split_rules: bool = True, 280 extend_abbreviations: bool = True, 281 remove_empty_lines: bool = True, 282 remove_extra_whitespaces: bool = True, 283 remove_repeated_substrings: bool = False, 284 keep_id: bool = False, 285 remove_substrings: list[str] | None = None, 286 remove_regex: str | None = None, 287 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 288 | None = None, 289 ascii_only: bool = False) -> None 290 ``` 291 292 Initialize a DocumentPreProcessor that first splits and then cleans documents. 293 294 **Splitter Parameters**: 295 296 **Arguments**: 297 298 - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence". 299 - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split. 300 - `split_overlap`: The number of overlapping units between consecutive splits. 301 - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged 302 with the previous split. 303 - `splitting_function`: A custom function for splitting if `split_by="function"`. 304 - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence. 305 - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or 306 `respect_sentence_boundary=True`. 307 - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter. 308 - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain 309 languages. 310 311 **Cleaner Parameters**: 312 - `remove_empty_lines`: If `True`, removes empty lines. 313 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 314 - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages. 315 - `keep_id`: If `True`, keeps the original document IDs. 316 - `remove_substrings`: A list of strings to remove from the document content. 317 - `remove_regex`: A regex pattern whose matches will be removed from the document content. 318 - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`. 319 - `ascii_only`: If `True`, converts text to ASCII only. 320 321 <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a> 322 323 #### DocumentPreprocessor.to\_dict 324 325 ```python 326 def to_dict() -> dict[str, Any] 327 ``` 328 329 Serialize SuperComponent to a dictionary. 330 331 **Returns**: 332 333 Dictionary with serialized data. 334 335 <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a> 336 337 #### DocumentPreprocessor.from\_dict 338 339 ```python 340 @classmethod 341 def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor" 342 ``` 343 344 Deserializes the SuperComponent from a dictionary. 345 346 **Arguments**: 347 348 - `data`: Dictionary to deserialize from. 349 350 **Returns**: 351 352 Deserialized SuperComponent. 353 354 <a id="document_splitter"></a> 355 356 ## Module document\_splitter 357 358 <a id="document_splitter.DocumentSplitter"></a> 359 360 ### DocumentSplitter 361 362 Splits long documents into smaller chunks. 363 364 This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations 365 and prevents exceeding language model context limits. 366 367 The DocumentSplitter is compatible with the following DocumentStores: 368 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 369 - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is 370 not stored 371 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 372 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 373 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 374 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping 375 information is not stored 376 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 377 - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore) 378 379 ### Usage example 380 381 ```python 382 from haystack import Document 383 from haystack.components.preprocessors import DocumentSplitter 384 385 doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.") 386 387 splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0) 388 result = splitter.run(documents=[doc]) 389 ``` 390 391 <a id="document_splitter.DocumentSplitter.__init__"></a> 392 393 #### DocumentSplitter.\_\_init\_\_ 394 395 ```python 396 def __init__(split_by: Literal["function", "page", "passage", "period", "word", 397 "line", "sentence"] = "word", 398 split_length: int = 200, 399 split_overlap: int = 0, 400 split_threshold: int = 0, 401 splitting_function: Callable[[str], list[str]] | None = None, 402 respect_sentence_boundary: bool = False, 403 language: Language = "en", 404 use_split_rules: bool = True, 405 extend_abbreviations: bool = True, 406 *, 407 skip_empty_documents: bool = True) 408 ``` 409 410 Initialize DocumentSplitter. 411 412 **Arguments**: 413 414 - `split_by`: The unit for splitting your documents. Choose from: 415 - `word` for splitting by spaces (" ") 416 - `period` for splitting by periods (".") 417 - `page` for splitting by form feed ("\f") 418 - `passage` for splitting by double line breaks ("\n\n") 419 - `line` for splitting each line ("\n") 420 - `sentence` for splitting by NLTK sentence tokenizer 421 - `split_length`: The maximum number of units in each split. 422 - `split_overlap`: The number of overlapping units for each split. 423 - `split_threshold`: The minimum number of units per split. If a split has fewer units 424 than the threshold, it's attached to the previous split. 425 - `splitting_function`: Necessary when `split_by` is set to "function". 426 This is a function which must accept a single `str` as input and return a `list` of `str` as output, 427 representing the chunks after splitting. 428 - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word". 429 If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. 430 - `language`: Choose the language for the NLTK tokenizer. The default is English ("en"). 431 - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`. 432 - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list 433 of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). 434 - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True. 435 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 436 from non-textual documents. 437 438 <a id="document_splitter.DocumentSplitter.warm_up"></a> 439 440 #### DocumentSplitter.warm\_up 441 442 ```python 443 def warm_up() 444 ``` 445 446 Warm up the DocumentSplitter by loading the sentence tokenizer. 447 448 <a id="document_splitter.DocumentSplitter.run"></a> 449 450 #### DocumentSplitter.run 451 452 ```python 453 @component.output_types(documents=list[Document]) 454 def run(documents: list[Document]) 455 ``` 456 457 Split documents into smaller parts. 458 459 Splits documents by the unit expressed in `split_by`, with a length of `split_length` 460 and an overlap of `split_overlap`. 461 462 **Arguments**: 463 464 - `documents`: The documents to split. 465 466 **Raises**: 467 468 - `TypeError`: if the input is not a list of Documents. 469 - `ValueError`: if the content of a document is None. 470 471 **Returns**: 472 473 A dictionary with the following key: 474 - `documents`: List of documents with the split texts. Each document includes: 475 - A metadata field `source_id` to track the original document. 476 - A metadata field `page_number` to track the original page number. 477 - All other metadata copied from the original document. 478 479 <a id="document_splitter.DocumentSplitter.to_dict"></a> 480 481 #### DocumentSplitter.to\_dict 482 483 ```python 484 def to_dict() -> dict[str, Any] 485 ``` 486 487 Serializes the component to a dictionary. 488 489 <a id="document_splitter.DocumentSplitter.from_dict"></a> 490 491 #### DocumentSplitter.from\_dict 492 493 ```python 494 @classmethod 495 def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter" 496 ``` 497 498 Deserializes the component from a dictionary. 499 500 <a id="embedding_based_document_splitter"></a> 501 502 ## Module embedding\_based\_document\_splitter 503 504 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter"></a> 505 506 ### EmbeddingBasedDocumentSplitter 507 508 Splits documents based on embedding similarity using cosine distances between sequential sentence groups. 509 510 This component first splits text into sentences, optionally groups them, calculates embeddings for each group, 511 and then uses cosine distance between sequential embeddings to determine split points. Any distance above 512 the specified percentile is treated as a break point. The component also tracks page numbers based on form feed 513 characters (``) in the original document. 514 515 This component is inspired by [5 Levels of Text Splitting]( 516 https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb 517 ) by Greg Kamradt. 518 519 ### Usage example 520 521 ```python 522 from haystack import Document 523 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 524 from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter 525 526 # Create a document with content that has a clear topic shift 527 doc = Document( 528 content="This is a first sentence. This is a second sentence. This is a third sentence. " 529 "Completely different topic. The same completely different topic." 530 ) 531 532 # Initialize the embedder to calculate semantic similarities 533 embedder = SentenceTransformersDocumentEmbedder() 534 535 # Configure the splitter with parameters that control splitting behavior 536 splitter = EmbeddingBasedDocumentSplitter( 537 document_embedder=embedder, 538 sentences_per_group=2, # Group 2 sentences before calculating embeddings 539 percentile=0.95, # Split when cosine distance exceeds 95th percentile 540 min_length=50, # Merge splits shorter than 50 characters 541 max_length=1000 # Further split chunks longer than 1000 characters 542 ) 543 splitter.warm_up() 544 result = splitter.run(documents=[doc]) 545 546 # The result contains a list of Document objects, each representing a semantic chunk 547 # Each split document includes metadata: source_id, split_id, and page_number 548 print(f"Original document split into {len(result['documents'])} chunks") 549 for i, split_doc in enumerate(result['documents']): 550 print(f"Chunk {i}: {split_doc.content[:50]}...") 551 ``` 552 553 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.__init__"></a> 554 555 #### EmbeddingBasedDocumentSplitter.\_\_init\_\_ 556 557 ```python 558 def __init__(*, 559 document_embedder: DocumentEmbedder, 560 sentences_per_group: int = 3, 561 percentile: float = 0.95, 562 min_length: int = 50, 563 max_length: int = 1000, 564 language: Language = "en", 565 use_split_rules: bool = True, 566 extend_abbreviations: bool = True) 567 ``` 568 569 Initialize EmbeddingBasedDocumentSplitter. 570 571 **Arguments**: 572 573 - `document_embedder`: The DocumentEmbedder to use for calculating embeddings. 574 - `sentences_per_group`: Number of sentences to group together before embedding. 575 - `percentile`: Percentile threshold for cosine distance. Distances above this percentile 576 are treated as break points. 577 - `min_length`: Minimum length of splits in characters. Splits below this length will be merged. 578 - `max_length`: Maximum length of splits in characters. Splits above this length will be recursively split. 579 - `language`: Language for sentence tokenization. 580 - `use_split_rules`: Whether to use additional split rules for sentence tokenization. Applies additional 581 split rules from SentenceSplitter to the sentence spans. 582 - `extend_abbreviations`: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list 583 of curated abbreviations. Currently supported languages are: en, de. 584 If False, the default abbreviations are used. 585 586 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.warm_up"></a> 587 588 #### EmbeddingBasedDocumentSplitter.warm\_up 589 590 ```python 591 def warm_up() -> None 592 ``` 593 594 Warm up the component by initializing the sentence splitter. 595 596 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.run"></a> 597 598 #### EmbeddingBasedDocumentSplitter.run 599 600 ```python 601 @component.output_types(documents=list[Document]) 602 def run(documents: list[Document]) -> dict[str, list[Document]] 603 ``` 604 605 Split documents based on embedding similarity. 606 607 **Arguments**: 608 609 - `documents`: The documents to split. 610 611 **Raises**: 612 613 - `None`: - `RuntimeError`: If the component wasn't warmed up. 614 - `TypeError`: If the input is not a list of Documents. 615 - `ValueError`: If the document content is None or empty. 616 617 **Returns**: 618 619 A dictionary with the following key: 620 - `documents`: List of documents with the split texts. Each document includes: 621 - A metadata field `source_id` to track the original document. 622 - A metadata field `split_id` to track the split number. 623 - A metadata field `page_number` to track the original page number. 624 - All other metadata copied from the original document. 625 626 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.to_dict"></a> 627 628 #### EmbeddingBasedDocumentSplitter.to\_dict 629 630 ```python 631 def to_dict() -> dict[str, Any] 632 ``` 633 634 Serializes the component to a dictionary. 635 636 **Returns**: 637 638 Serialized dictionary representation of the component. 639 640 <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.from_dict"></a> 641 642 #### EmbeddingBasedDocumentSplitter.from\_dict 643 644 ```python 645 @classmethod 646 def from_dict(cls, data: dict[str, Any]) -> "EmbeddingBasedDocumentSplitter" 647 ``` 648 649 Deserializes the component from a dictionary. 650 651 **Arguments**: 652 653 - `data`: The dictionary to deserialize and create the component. 654 655 **Returns**: 656 657 The deserialized component. 658 659 <a id="hierarchical_document_splitter"></a> 660 661 ## Module hierarchical\_document\_splitter 662 663 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a> 664 665 ### HierarchicalDocumentSplitter 666 667 Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes. 668 669 The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between 670 are connected such that the smaller blocks are children of the parent-larger blocks. 671 672 ## Usage example 673 ```python 674 from haystack import Document 675 from haystack.components.preprocessors import HierarchicalDocumentSplitter 676 677 doc = Document(content="This is a simple test document") 678 splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word") 679 splitter.run([doc]) 680 >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}), 681 >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 682 >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}), 683 >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 684 >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}), 685 >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 686 >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]} 687 ``` 688 689 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a> 690 691 #### HierarchicalDocumentSplitter.\_\_init\_\_ 692 693 ```python 694 def __init__(block_sizes: set[int], 695 split_overlap: int = 0, 696 split_by: Literal["word", "sentence", "page", 697 "passage"] = "word") 698 ``` 699 700 Initialize HierarchicalDocumentSplitter. 701 702 **Arguments**: 703 704 - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order. 705 - `split_overlap`: The number of overlapping units for each split. 706 - `split_by`: The unit for splitting your documents. 707 708 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a> 709 710 #### HierarchicalDocumentSplitter.run 711 712 ```python 713 @component.output_types(documents=list[Document]) 714 def run(documents: list[Document]) 715 ``` 716 717 Builds a hierarchical document structure for each document in a list of documents. 718 719 **Arguments**: 720 721 - `documents`: List of Documents to split into hierarchical blocks. 722 723 **Returns**: 724 725 List of HierarchicalDocument 726 727 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a> 728 729 #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc 730 731 ```python 732 def build_hierarchy_from_doc(document: Document) -> list[Document] 733 ``` 734 735 Build a hierarchical tree document structure from a single document. 736 737 Given a document, this function splits the document into hierarchical blocks of different sizes represented 738 as HierarchicalDocument objects. 739 740 **Arguments**: 741 742 - `document`: Document to split into hierarchical blocks. 743 744 **Returns**: 745 746 List of HierarchicalDocument 747 748 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a> 749 750 #### HierarchicalDocumentSplitter.to\_dict 751 752 ```python 753 def to_dict() -> dict[str, Any] 754 ``` 755 756 Returns a dictionary representation of the component. 757 758 **Returns**: 759 760 Serialized dictionary representation of the component. 761 762 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a> 763 764 #### HierarchicalDocumentSplitter.from\_dict 765 766 ```python 767 @classmethod 768 def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter" 769 ``` 770 771 Deserialize this component from a dictionary. 772 773 **Arguments**: 774 775 - `data`: The dictionary to deserialize and create the component. 776 777 **Returns**: 778 779 The deserialized component. 780 781 <a id="recursive_splitter"></a> 782 783 ## Module recursive\_splitter 784 785 <a id="recursive_splitter.RecursiveDocumentSplitter"></a> 786 787 ### RecursiveDocumentSplitter 788 789 Recursively chunk text into smaller chunks. 790 791 This component is used to split text into smaller chunks, it does so by recursively applying a list of separators 792 to the text. 793 794 The separators are applied in the order they are provided, typically this is a list of separators that are 795 applied in a specific order, being the last separator the most specific one. 796 797 Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that 798 are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the 799 list to the remaining text. 800 801 This is done until all chunks are smaller than the split_length parameter. 802 803 **Example**: 804 805 806 ```python 807 from haystack import Document 808 from haystack.components.preprocessors import RecursiveDocumentSplitter 809 810 chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "]) 811 text = ('''Artificial intelligence (AI) - Introduction 812 813 AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. 814 AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''') 815 chunker.warm_up() 816 doc = Document(content=text) 817 doc_chunks = chunker.run([doc]) 818 print(doc_chunks["documents"]) 819 >[ 820 >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []}) 821 >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []}) 822 >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []}) 823 >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []}) 824 >] 825 ``` 826 827 <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a> 828 829 #### RecursiveDocumentSplitter.\_\_init\_\_ 830 831 ```python 832 def __init__(*, 833 split_length: int = 200, 834 split_overlap: int = 0, 835 split_unit: Literal["word", "char", "token"] = "word", 836 separators: list[str] | None = None, 837 sentence_splitter_params: dict[str, Any] | None = None) 838 ``` 839 840 Initializes a RecursiveDocumentSplitter. 841 842 **Arguments**: 843 844 - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens. 845 See the `split_units` parameter. 846 - `split_overlap`: The number of characters to overlap between consecutive chunks. 847 - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token". 848 If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). 849 - `separators`: An optional list of separator strings to use for splitting the text. The string 850 separators will be treated as regular expressions unless the separator is "sentence", in that case the 851 text will be split into sentences using a custom sentence tokenizer based on NLTK. 852 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter. 853 If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used. 854 - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer. 855 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information. 856 857 **Raises**: 858 859 - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or 860 if any separator is not a string. 861 862 <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a> 863 864 #### RecursiveDocumentSplitter.warm\_up 865 866 ```python 867 def warm_up() -> None 868 ``` 869 870 Warm up the sentence tokenizer and tiktoken tokenizer if needed. 871 872 <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a> 873 874 #### RecursiveDocumentSplitter.run 875 876 ```python 877 @component.output_types(documents=list[Document]) 878 def run(documents: list[Document]) -> dict[str, list[Document]] 879 ``` 880 881 Split a list of documents into documents with smaller chunks of text. 882 883 **Arguments**: 884 885 - `documents`: List of Documents to split. 886 887 **Returns**: 888 889 A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding 890 to the input documents. 891 892 <a id="text_cleaner"></a> 893 894 ## Module text\_cleaner 895 896 <a id="text_cleaner.TextCleaner"></a> 897 898 ### TextCleaner 899 900 Cleans text strings. 901 902 It can remove substrings matching a list of regular expressions, convert text to lowercase, 903 remove punctuation, and remove numbers. 904 Use it to clean up text data before evaluation. 905 906 ### Usage example 907 908 ```python 909 from haystack.components.preprocessors import TextCleaner 910 911 text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything." 912 913 cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True) 914 result = cleaner.run(texts=[text_to_clean]) 915 ``` 916 917 <a id="text_cleaner.TextCleaner.__init__"></a> 918 919 #### TextCleaner.\_\_init\_\_ 920 921 ```python 922 def __init__(remove_regexps: list[str] | None = None, 923 convert_to_lowercase: bool = False, 924 remove_punctuation: bool = False, 925 remove_numbers: bool = False) 926 ``` 927 928 Initializes the TextCleaner component. 929 930 **Arguments**: 931 932 - `remove_regexps`: A list of regex patterns to remove matching substrings from the text. 933 - `convert_to_lowercase`: If `True`, converts all characters to lowercase. 934 - `remove_punctuation`: If `True`, removes punctuation from the text. 935 - `remove_numbers`: If `True`, removes numerical digits from the text. 936 937 <a id="text_cleaner.TextCleaner.run"></a> 938 939 #### TextCleaner.run 940 941 ```python 942 @component.output_types(texts=list[str]) 943 def run(texts: list[str]) -> dict[str, Any] 944 ``` 945 946 Cleans up the given list of strings. 947 948 **Arguments**: 949 950 - `texts`: List of strings to clean. 951 952 **Returns**: 953 954 A dictionary with the following key: 955 - `texts`: the cleaned list of strings. 956