preprocessors_api.md
1 --- 2 title: "PreProcessors" 3 id: preprocessors-api 4 description: "Preprocess your Documents and texts. Clean, split, and more." 5 slug: "/preprocessors-api" 6 --- 7 8 <a id="csv_document_cleaner"></a> 9 10 ## Module csv\_document\_cleaner 11 12 <a id="csv_document_cleaner.CSVDocumentCleaner"></a> 13 14 ### CSVDocumentCleaner 15 16 A component for cleaning CSV documents by removing empty rows and columns. 17 18 This component processes CSV content stored in Documents, allowing 19 for the optional ignoring of a specified number of rows and columns before performing 20 the cleaning operation. Additionally, it provides options to keep document IDs and 21 control whether empty rows and columns should be removed. 22 23 <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a> 24 25 #### CSVDocumentCleaner.\_\_init\_\_ 26 27 ```python 28 def __init__(*, 29 ignore_rows: int = 0, 30 ignore_columns: int = 0, 31 remove_empty_rows: bool = True, 32 remove_empty_columns: bool = True, 33 keep_id: bool = False) -> None 34 ``` 35 36 Initializes the CSVDocumentCleaner component. 37 38 **Arguments**: 39 40 - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing. 41 - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing. 42 - `remove_empty_rows`: Whether to remove rows that are entirely empty. 43 - `remove_empty_columns`: Whether to remove columns that are entirely empty. 44 - `keep_id`: Whether to retain the original document ID in the output document. 45 Rows and columns ignored using these parameters are preserved in the final output, meaning 46 they are not considered when removing empty rows and columns. 47 48 <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a> 49 50 #### CSVDocumentCleaner.run 51 52 ```python 53 @component.output_types(documents=list[Document]) 54 def run(documents: list[Document]) -> dict[str, list[Document]] 55 ``` 56 57 Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns. 58 59 **Arguments**: 60 61 - `documents`: List of Documents containing CSV-formatted content. 62 63 **Returns**: 64 65 A dictionary with a list of cleaned Documents under the key "documents". 66 Processing steps: 67 1. Reads each document's content as a CSV table. 68 2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left. 69 3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and 70 `remove_empty_columns`). 71 4. Reattaches the ignored rows and columns to maintain their original positions. 72 5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original 73 document ID. 74 75 <a id="csv_document_splitter"></a> 76 77 ## Module csv\_document\_splitter 78 79 <a id="csv_document_splitter.CSVDocumentSplitter"></a> 80 81 ### CSVDocumentSplitter 82 83 A component for splitting CSV documents into sub-tables based on split arguments. 84 85 The splitter supports two modes of operation: 86 - identify consecutive empty rows or columns that exceed a given threshold 87 and uses them as delimiters to segment the document into smaller tables. 88 - split each row into a separate sub-table, represented as a Document. 89 90 <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a> 91 92 #### CSVDocumentSplitter.\_\_init\_\_ 93 94 ```python 95 def __init__(row_split_threshold: int | None = 2, 96 column_split_threshold: int | None = 2, 97 read_csv_kwargs: dict[str, Any] | None = None, 98 split_mode: SplitMode = "threshold") -> None 99 ``` 100 101 Initializes the CSVDocumentSplitter component. 102 103 **Arguments**: 104 105 - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split. 106 - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split. 107 - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`. 108 By default, the component with options: 109 - `header=None` 110 - `skip_blank_lines=False` to preserve blank lines 111 - `dtype=object` to prevent type inference (e.g., converting numbers to floats). 112 See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information. 113 - `split_mode`: If `threshold`, the component will split the document based on the number of 114 consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`. 115 If `row-wise`, the component will split each row into a separate sub-table. 116 117 <a id="csv_document_splitter.CSVDocumentSplitter.run"></a> 118 119 #### CSVDocumentSplitter.run 120 121 ```python 122 @component.output_types(documents=list[Document]) 123 def run(documents: list[Document]) -> dict[str, list[Document]] 124 ``` 125 126 Processes and splits a list of CSV documents into multiple sub-tables. 127 128 **Splitting Process:** 129 1. Applies a row-based split if `row_split_threshold` is provided. 130 2. Applies a column-based split if `column_split_threshold` is provided. 131 3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring 132 further fragmentation of any sub-tables that still contain empty sections. 133 4. Sorts the resulting sub-tables based on their original positions within the document. 134 135 **Arguments**: 136 137 - `documents`: A list of Documents containing CSV-formatted content. 138 Each document is assumed to contain one or more tables separated by empty rows or columns. 139 140 **Returns**: 141 142 A dictionary with a key `"documents"`, mapping to a list of new `Document` objects, 143 each representing an extracted sub-table from the original CSV. 144 The metadata of each document includes: 145 - A field `source_id` to track the original document. 146 - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table. 147 - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table. 148 - A field `split_id` to indicate the order of the split in the original document. 149 - All other metadata copied from the original document. 150 151 - If a document cannot be processed, it is returned unchanged. 152 - The `meta` field from the original document is preserved in the split documents. 153 154 <a id="document_cleaner"></a> 155 156 ## Module document\_cleaner 157 158 <a id="document_cleaner.DocumentCleaner"></a> 159 160 ### DocumentCleaner 161 162 Cleans the text in the documents. 163 164 It removes extra whitespaces, 165 empty lines, specified substrings, regexes, 166 page headers and footers (in this order). 167 168 ### Usage example: 169 170 ```python 171 from haystack import Document 172 from haystack.components.preprocessors import DocumentCleaner 173 174 doc = Document(content="This is a document to clean\n\n\nsubstring to remove") 175 176 cleaner = DocumentCleaner(remove_substrings = ["substring to remove"]) 177 result = cleaner.run(documents=[doc]) 178 179 assert result["documents"][0].content == "This is a document to clean " 180 ``` 181 182 <a id="document_cleaner.DocumentCleaner.__init__"></a> 183 184 #### DocumentCleaner.\_\_init\_\_ 185 186 ```python 187 def __init__(remove_empty_lines: bool = True, 188 remove_extra_whitespaces: bool = True, 189 remove_repeated_substrings: bool = False, 190 keep_id: bool = False, 191 remove_substrings: list[str] | None = None, 192 remove_regex: str | None = None, 193 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 194 | None = None, 195 ascii_only: bool = False) 196 ``` 197 198 Initialize DocumentCleaner. 199 200 **Arguments**: 201 202 - `remove_empty_lines`: If `True`, removes empty lines. 203 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 204 - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages. 205 Pages must be separated by a form feed character "\f", 206 which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`. 207 - `remove_substrings`: List of substrings to remove from the text. 208 - `remove_regex`: Regex to match and replace substrings by "". 209 - `keep_id`: If `True`, keeps the IDs of the original documents. 210 - `unicode_normalization`: Unicode normalization form to apply to the text. 211 Note: This will run before any other steps. 212 - `ascii_only`: Whether to convert the text to ASCII only. 213 Will remove accents from characters and replace them with ASCII characters. 214 Other non-ASCII characters will be removed. 215 Note: This will run before any pattern matching or removal. 216 217 <a id="document_cleaner.DocumentCleaner.run"></a> 218 219 #### DocumentCleaner.run 220 221 ```python 222 @component.output_types(documents=list[Document]) 223 def run(documents: list[Document]) 224 ``` 225 226 Cleans up the documents. 227 228 **Arguments**: 229 230 - `documents`: List of Documents to clean. 231 232 **Raises**: 233 234 - `TypeError`: if documents is not a list of Documents. 235 236 **Returns**: 237 238 A dictionary with the following key: 239 - `documents`: List of cleaned Documents. 240 241 <a id="document_preprocessor"></a> 242 243 ## Module document\_preprocessor 244 245 <a id="document_preprocessor.DocumentPreprocessor"></a> 246 247 ### DocumentPreprocessor 248 249 A SuperComponent that first splits and then cleans documents. 250 251 This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline. 252 It takes a list of documents as input and returns a processed list of documents. 253 254 Usage example: 255 ```python 256 from haystack import Document 257 from haystack.components.preprocessors import DocumentPreprocessor 258 259 doc = Document(content="I love pizza!") 260 preprocessor = DocumentPreprocessor() 261 result = preprocessor.run(documents=[doc]) 262 print(result["documents"]) 263 ``` 264 265 <a id="document_preprocessor.DocumentPreprocessor.__init__"></a> 266 267 #### DocumentPreprocessor.\_\_init\_\_ 268 269 ```python 270 def __init__(*, 271 split_by: Literal["function", "page", "passage", "period", "word", 272 "line", "sentence"] = "word", 273 split_length: int = 250, 274 split_overlap: int = 0, 275 split_threshold: int = 0, 276 splitting_function: Callable[[str], list[str]] | None = None, 277 respect_sentence_boundary: bool = False, 278 language: Language = "en", 279 use_split_rules: bool = True, 280 extend_abbreviations: bool = True, 281 remove_empty_lines: bool = True, 282 remove_extra_whitespaces: bool = True, 283 remove_repeated_substrings: bool = False, 284 keep_id: bool = False, 285 remove_substrings: list[str] | None = None, 286 remove_regex: str | None = None, 287 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] 288 | None = None, 289 ascii_only: bool = False) -> None 290 ``` 291 292 Initialize a DocumentPreProcessor that first splits and then cleans documents. 293 294 **Splitter Parameters**: 295 296 **Arguments**: 297 298 - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence". 299 - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split. 300 - `split_overlap`: The number of overlapping units between consecutive splits. 301 - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged 302 with the previous split. 303 - `splitting_function`: A custom function for splitting if `split_by="function"`. 304 - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence. 305 - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or 306 `respect_sentence_boundary=True`. 307 - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter. 308 - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain 309 languages. 310 311 **Cleaner Parameters**: 312 - `remove_empty_lines`: If `True`, removes empty lines. 313 - `remove_extra_whitespaces`: If `True`, removes extra whitespaces. 314 - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages. 315 - `keep_id`: If `True`, keeps the original document IDs. 316 - `remove_substrings`: A list of strings to remove from the document content. 317 - `remove_regex`: A regex pattern whose matches will be removed from the document content. 318 - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`. 319 - `ascii_only`: If `True`, converts text to ASCII only. 320 321 <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a> 322 323 #### DocumentPreprocessor.to\_dict 324 325 ```python 326 def to_dict() -> dict[str, Any] 327 ``` 328 329 Serialize SuperComponent to a dictionary. 330 331 **Returns**: 332 333 Dictionary with serialized data. 334 335 <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a> 336 337 #### DocumentPreprocessor.from\_dict 338 339 ```python 340 @classmethod 341 def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor" 342 ``` 343 344 Deserializes the SuperComponent from a dictionary. 345 346 **Arguments**: 347 348 - `data`: Dictionary to deserialize from. 349 350 **Returns**: 351 352 Deserialized SuperComponent. 353 354 <a id="document_splitter"></a> 355 356 ## Module document\_splitter 357 358 <a id="document_splitter.DocumentSplitter"></a> 359 360 ### DocumentSplitter 361 362 Splits long documents into smaller chunks. 363 364 This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations 365 and prevents exceeding language model context limits. 366 367 The DocumentSplitter is compatible with the following DocumentStores: 368 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 369 - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is 370 not stored 371 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 372 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 373 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 374 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping 375 information is not stored 376 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 377 - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore) 378 379 ### Usage example 380 381 ```python 382 from haystack import Document 383 from haystack.components.preprocessors import DocumentSplitter 384 385 doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.") 386 387 splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0) 388 result = splitter.run(documents=[doc]) 389 ``` 390 391 <a id="document_splitter.DocumentSplitter.__init__"></a> 392 393 #### DocumentSplitter.\_\_init\_\_ 394 395 ```python 396 def __init__(split_by: Literal["function", "page", "passage", "period", "word", 397 "line", "sentence"] = "word", 398 split_length: int = 200, 399 split_overlap: int = 0, 400 split_threshold: int = 0, 401 splitting_function: Callable[[str], list[str]] | None = None, 402 respect_sentence_boundary: bool = False, 403 language: Language = "en", 404 use_split_rules: bool = True, 405 extend_abbreviations: bool = True, 406 *, 407 skip_empty_documents: bool = True) 408 ``` 409 410 Initialize DocumentSplitter. 411 412 **Arguments**: 413 414 - `split_by`: The unit for splitting your documents. Choose from: 415 - `word` for splitting by spaces (" ") 416 - `period` for splitting by periods (".") 417 - `page` for splitting by form feed ("\f") 418 - `passage` for splitting by double line breaks ("\n\n") 419 - `line` for splitting each line ("\n") 420 - `sentence` for splitting by NLTK sentence tokenizer 421 - `split_length`: The maximum number of units in each split. 422 - `split_overlap`: The number of overlapping units for each split. 423 - `split_threshold`: The minimum number of units per split. If a split has fewer units 424 than the threshold, it's attached to the previous split. 425 - `splitting_function`: Necessary when `split_by` is set to "function". 426 This is a function which must accept a single `str` as input and return a `list` of `str` as output, 427 representing the chunks after splitting. 428 - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word". 429 If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. 430 - `language`: Choose the language for the NLTK tokenizer. The default is English ("en"). 431 - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`. 432 - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list 433 of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). 434 - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True. 435 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 436 from non-textual documents. 437 438 <a id="document_splitter.DocumentSplitter.warm_up"></a> 439 440 #### DocumentSplitter.warm\_up 441 442 ```python 443 def warm_up() 444 ``` 445 446 Warm up the DocumentSplitter by loading the sentence tokenizer. 447 448 <a id="document_splitter.DocumentSplitter.run"></a> 449 450 #### DocumentSplitter.run 451 452 ```python 453 @component.output_types(documents=list[Document]) 454 def run(documents: list[Document]) 455 ``` 456 457 Split documents into smaller parts. 458 459 Splits documents by the unit expressed in `split_by`, with a length of `split_length` 460 and an overlap of `split_overlap`. 461 462 **Arguments**: 463 464 - `documents`: The documents to split. 465 466 **Raises**: 467 468 - `TypeError`: if the input is not a list of Documents. 469 - `ValueError`: if the content of a document is None. 470 471 **Returns**: 472 473 A dictionary with the following key: 474 - `documents`: List of documents with the split texts. Each document includes: 475 - A metadata field `source_id` to track the original document. 476 - A metadata field `page_number` to track the original page number. 477 - All other metadata copied from the original document. 478 479 <a id="document_splitter.DocumentSplitter.to_dict"></a> 480 481 #### DocumentSplitter.to\_dict 482 483 ```python 484 def to_dict() -> dict[str, Any] 485 ``` 486 487 Serializes the component to a dictionary. 488 489 <a id="document_splitter.DocumentSplitter.from_dict"></a> 490 491 #### DocumentSplitter.from\_dict 492 493 ```python 494 @classmethod 495 def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter" 496 ``` 497 498 Deserializes the component from a dictionary. 499 500 <a id="hierarchical_document_splitter"></a> 501 502 ## Module hierarchical\_document\_splitter 503 504 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a> 505 506 ### HierarchicalDocumentSplitter 507 508 Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes. 509 510 The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between 511 are connected such that the smaller blocks are children of the parent-larger blocks. 512 513 ## Usage example 514 ```python 515 from haystack import Document 516 from haystack.components.preprocessors import HierarchicalDocumentSplitter 517 518 doc = Document(content="This is a simple test document") 519 splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word") 520 splitter.run([doc]) 521 >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}), 522 >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 523 >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}), 524 >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 525 >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}), 526 >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 527 >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]} 528 ``` 529 530 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a> 531 532 #### HierarchicalDocumentSplitter.\_\_init\_\_ 533 534 ```python 535 def __init__(block_sizes: set[int], 536 split_overlap: int = 0, 537 split_by: Literal["word", "sentence", "page", 538 "passage"] = "word") 539 ``` 540 541 Initialize HierarchicalDocumentSplitter. 542 543 **Arguments**: 544 545 - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order. 546 - `split_overlap`: The number of overlapping units for each split. 547 - `split_by`: The unit for splitting your documents. 548 549 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a> 550 551 #### HierarchicalDocumentSplitter.run 552 553 ```python 554 @component.output_types(documents=list[Document]) 555 def run(documents: list[Document]) 556 ``` 557 558 Builds a hierarchical document structure for each document in a list of documents. 559 560 **Arguments**: 561 562 - `documents`: List of Documents to split into hierarchical blocks. 563 564 **Returns**: 565 566 List of HierarchicalDocument 567 568 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a> 569 570 #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc 571 572 ```python 573 def build_hierarchy_from_doc(document: Document) -> list[Document] 574 ``` 575 576 Build a hierarchical tree document structure from a single document. 577 578 Given a document, this function splits the document into hierarchical blocks of different sizes represented 579 as HierarchicalDocument objects. 580 581 **Arguments**: 582 583 - `document`: Document to split into hierarchical blocks. 584 585 **Returns**: 586 587 List of HierarchicalDocument 588 589 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a> 590 591 #### HierarchicalDocumentSplitter.to\_dict 592 593 ```python 594 def to_dict() -> dict[str, Any] 595 ``` 596 597 Returns a dictionary representation of the component. 598 599 **Returns**: 600 601 Serialized dictionary representation of the component. 602 603 <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a> 604 605 #### HierarchicalDocumentSplitter.from\_dict 606 607 ```python 608 @classmethod 609 def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter" 610 ``` 611 612 Deserialize this component from a dictionary. 613 614 **Arguments**: 615 616 - `data`: The dictionary to deserialize and create the component. 617 618 **Returns**: 619 620 The deserialized component. 621 622 <a id="recursive_splitter"></a> 623 624 ## Module recursive\_splitter 625 626 <a id="recursive_splitter.RecursiveDocumentSplitter"></a> 627 628 ### RecursiveDocumentSplitter 629 630 Recursively chunk text into smaller chunks. 631 632 This component is used to split text into smaller chunks, it does so by recursively applying a list of separators 633 to the text. 634 635 The separators are applied in the order they are provided, typically this is a list of separators that are 636 applied in a specific order, being the last separator the most specific one. 637 638 Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that 639 are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the 640 list to the remaining text. 641 642 This is done until all chunks are smaller than the split_length parameter. 643 644 **Example**: 645 646 647 ```python 648 from haystack import Document 649 from haystack.components.preprocessors import RecursiveDocumentSplitter 650 651 chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "]) 652 text = ('''Artificial intelligence (AI) - Introduction 653 654 AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. 655 AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''') 656 chunker.warm_up() 657 doc = Document(content=text) 658 doc_chunks = chunker.run([doc]) 659 print(doc_chunks["documents"]) 660 >[ 661 >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []}) 662 >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []}) 663 >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []}) 664 >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []}) 665 >] 666 ``` 667 668 <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a> 669 670 #### RecursiveDocumentSplitter.\_\_init\_\_ 671 672 ```python 673 def __init__(*, 674 split_length: int = 200, 675 split_overlap: int = 0, 676 split_unit: Literal["word", "char", "token"] = "word", 677 separators: list[str] | None = None, 678 sentence_splitter_params: dict[str, Any] | None = None) 679 ``` 680 681 Initializes a RecursiveDocumentSplitter. 682 683 **Arguments**: 684 685 - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens. 686 See the `split_units` parameter. 687 - `split_overlap`: The number of characters to overlap between consecutive chunks. 688 - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token". 689 If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). 690 - `separators`: An optional list of separator strings to use for splitting the text. The string 691 separators will be treated as regular expressions unless the separator is "sentence", in that case the 692 text will be split into sentences using a custom sentence tokenizer based on NLTK. 693 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter. 694 If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used. 695 - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer. 696 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information. 697 698 **Raises**: 699 700 - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or 701 if any separator is not a string. 702 703 <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a> 704 705 #### RecursiveDocumentSplitter.warm\_up 706 707 ```python 708 def warm_up() -> None 709 ``` 710 711 Warm up the sentence tokenizer and tiktoken tokenizer if needed. 712 713 <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a> 714 715 #### RecursiveDocumentSplitter.run 716 717 ```python 718 @component.output_types(documents=list[Document]) 719 def run(documents: list[Document]) -> dict[str, list[Document]] 720 ``` 721 722 Split a list of documents into documents with smaller chunks of text. 723 724 **Arguments**: 725 726 - `documents`: List of Documents to split. 727 728 **Returns**: 729 730 A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding 731 to the input documents. 732 733 <a id="text_cleaner"></a> 734 735 ## Module text\_cleaner 736 737 <a id="text_cleaner.TextCleaner"></a> 738 739 ### TextCleaner 740 741 Cleans text strings. 742 743 It can remove substrings matching a list of regular expressions, convert text to lowercase, 744 remove punctuation, and remove numbers. 745 Use it to clean up text data before evaluation. 746 747 ### Usage example 748 749 ```python 750 from haystack.components.preprocessors import TextCleaner 751 752 text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything." 753 754 cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True) 755 result = cleaner.run(texts=[text_to_clean]) 756 ``` 757 758 <a id="text_cleaner.TextCleaner.__init__"></a> 759 760 #### TextCleaner.\_\_init\_\_ 761 762 ```python 763 def __init__(remove_regexps: list[str] | None = None, 764 convert_to_lowercase: bool = False, 765 remove_punctuation: bool = False, 766 remove_numbers: bool = False) 767 ``` 768 769 Initializes the TextCleaner component. 770 771 **Arguments**: 772 773 - `remove_regexps`: A list of regex patterns to remove matching substrings from the text. 774 - `convert_to_lowercase`: If `True`, converts all characters to lowercase. 775 - `remove_punctuation`: If `True`, removes punctuation from the text. 776 - `remove_numbers`: If `True`, removes numerical digits from the text. 777 778 <a id="text_cleaner.TextCleaner.run"></a> 779 780 #### TextCleaner.run 781 782 ```python 783 @component.output_types(texts=list[str]) 784 def run(texts: list[str]) -> dict[str, Any] 785 ``` 786 787 Cleans up the given list of strings. 788 789 **Arguments**: 790 791 - `texts`: List of strings to clean. 792 793 **Returns**: 794 795 A dictionary with the following key: 796 - `texts`: the cleaned list of strings. 797