/ docs-website / versioned_docs / version-2.21 / pipeline-components / preprocessors / recursivesplitter.mdx
recursivesplitter.mdx
1 --- 2 title: "RecursiveDocumentSplitter" 3 id: recursivesplitter 4 slug: "/recursivesplitter" 5 description: "This component recursively breaks down text into smaller chunks by applying a given list of separators to the text." 6 --- 7 8 # RecursiveDocumentSplitter 9 10 This component recursively breaks down text into smaller chunks by applying a given list of separators to the text. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | Most common position in a pipeline | In indexing pipelines after [Converters](../converters.mdx) and [`DocumentCleaner`](documentcleaner.mdx) , before [Classifiers](../classifiers.mdx) | 17 | Mandatory run variables | `documents`: A list of documents | 18 | Output variables | `documents`: A list of documents | 19 | API reference | [PreProcessors](/reference/preprocessors-api) | 20 | Github link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py | 21 22 </div> 23 24 ## Overview 25 26 The `RecursiveDocumentSplitter` expects a list of documents as input and returns a list of documents with split texts. You can set the following parameters when initializing the component: 27 28 - `split_length`: The maximum length of each chunk, in words, by default. See the `split_units` parameter to change the the unit. 29 - `split_overlap`: The number of characters or words that overlap between consecutive chunks. 30 - `split_unit`: The unit of the `split_length` parameter. Can be either `"word"`, `"char"`, or `"token"`. 31 - `separators`: An optional list of separator strings to use for splitting the text. If you don’t provide any separators, the default ones are `["\n\n", "sentence", "\n", " "]`. The string separators will be treated as regular expressions. If the separator is `"sentence"`, the text will be split into sentences using a custom sentence tokenizer based on NLTK. See [SentenceSplitter](https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/sentence_tokenizer.py#L116) code for more information. 32 - `sentence_splitter_params`: Optional parameters to pass to the [SentenceSplitter](https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/sentence_tokenizer.py#L116). 33 34 The separators are applied in the same order as they are defined in the list. The first separator is used on the text; any resulting chunk that is within the specified `chunk_size` is retained. For chunks that exceed the defined `chunk_size`, the next separator in the list is applied. If all separators are used and the chunk still exceeds the `chunk_size`, a hard split occurs based on the `chunk_size`, taking into account whether words or characters are used as counting units. This process is repeated until all chunks are within the limits of the specified `chunk_size`. 35 36 ## Usage 37 38 ```python 39 from haystack import Document 40 from haystack.components.preprocessors import RecursiveDocumentSplitter 41 42 chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "]) 43 text = ('''Artificial intelligence (AI) - Introduction 44 45 AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. 46 AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''') 47 chunker.warm_up() 48 doc = Document(content=text) 49 doc_chunks = chunker.run([doc]) 50 print(doc_chunks["documents"]) 51 >[ 52 >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []}) 53 >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []}) 54 >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []}) 55 >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []}) 56 >] 57 ``` 58 59 ### In a pipeline 60 61 Here's how you can use `RecursiveSplitter` in an indexing pipeline: 62 63 ```python 64 from pathlib import Path 65 66 from haystack import Document 67 from haystack import Pipeline 68 from haystack.document_stores.in_memory import InMemoryDocumentStore 69 from haystack.components.converters.txt import TextFileToDocument 70 from haystack.components.preprocessors import DocumentCleaner 71 from haystack.components.preprocessors import RecursiveDocumentSplitter 72 from haystack.components.writers import DocumentWriter 73 74 document_store = InMemoryDocumentStore() 75 p = Pipeline() 76 p.add_component(instance=TextFileToDocument(), name="text_file_converter") 77 p.add_component(instance=DocumentCleaner(), name="cleaner") 78 p.add_component( 79 instance=RecursiveDocumentSplitter( 80 split_length=400, 81 split_overlap=0, 82 split_unit="char", 83 separators=["\n\n", "\n", "sentence", " "], 84 sentence_splitter_params={ 85 "language": "en", 86 "use_split_rules": True, 87 "keep_white_spaces": False, 88 }, 89 ), 90 name="recursive_splitter", 91 ) 92 p.add_component(instance=DocumentWriter(document_store=document_store), name="writer") 93 p.connect("text_file_converter.documents", "cleaner.documents") 94 p.connect("cleaner.documents", "splitter.documents") 95 p.connect("splitter.documents", "writer.documents") 96 97 path = "path/to/your/files" 98 files = list(Path(path).glob("*.md")) 99 p.run({"text_file_converter": {"sources": files}}) 100 ```