Cradicle Explorer

/ docs-website / versioned_docs / version-2.21 / pipeline-components / preprocessors / recursivesplitter.mdx
recursivesplitter.mdx
  1  ---
  2  title: "RecursiveDocumentSplitter"
  3  id: recursivesplitter
  4  slug: "/recursivesplitter"
  5  description: "This component recursively breaks down text into smaller chunks by applying a given list of separators to the text."
  6  ---
  7  
  8  # RecursiveDocumentSplitter
  9  
 10  This component recursively breaks down text into smaller chunks by applying a given list of separators to the text.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | Most common position in a pipeline | In indexing pipelines after [Converters](../converters.mdx)   and [`DocumentCleaner`](documentcleaner.mdx)  , before [Classifiers](../classifiers.mdx) |
 17  | Mandatory run variables            | `documents`: A list of documents                                                                                                                       |
 18  | Output variables                   | `documents`: A list of documents                                                                                                                       |
 19  | API reference                      | [PreProcessors](/reference/preprocessors-api)                                                                                                                 |
 20  | Github link                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py                                             |
 21  
 22  </div>
 23  
 24  ## Overview
 25  
 26  The `RecursiveDocumentSplitter` expects a list of documents as input and returns a list of documents with split texts. You can set the following parameters when initializing the component:
 27  
 28  - `split_length`: The maximum length of each chunk, in words, by default. See the `split_units` parameter to change the the unit.
 29  - `split_overlap`: The number of characters or words that overlap between consecutive chunks.
 30  - `split_unit`: The unit of the `split_length` parameter. Can be either `"word"`, `"char"`, or `"token"`.
 31  - `separators`: An optional list of separator strings to use for splitting the text. If you don’t provide any separators, the default ones are `["\n\n", "sentence", "\n", " "]`. The string separators will be treated as regular expressions. If the separator is `"sentence"`, the text will be split into sentences using a custom sentence tokenizer based on NLTK. See [SentenceSplitter](https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/sentence_tokenizer.py#L116) code for more information.
 32  - `sentence_splitter_params`: Optional parameters to pass to the [SentenceSplitter](https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/sentence_tokenizer.py#L116).
 33  
 34  The separators are applied in the same order as they are defined in the list. The first separator is used on the text; any resulting chunk that is within the specified `chunk_size` is retained. For chunks that exceed the defined `chunk_size`, the next separator in the list is applied. If all separators are used and the chunk still exceeds the `chunk_size`, a hard split occurs based on the `chunk_size`, taking into account whether words or characters are used as counting units. This process is repeated until all chunks are within the limits of the specified `chunk_size`.
 35  
 36  ## Usage
 37  
 38  ```python
 39  from haystack import Document
 40  from haystack.components.preprocessors import RecursiveDocumentSplitter
 41  
 42  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
 43  text = ('''Artificial intelligence (AI) - Introduction
 44  
 45  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
 46  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
 47  chunker.warm_up()
 48  doc = Document(content=text)
 49  doc_chunks = chunker.run([doc])
 50  print(doc_chunks["documents"])
 51  >[
 52  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
 53  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
 54  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
 55  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
 56  >]
 57  ```
 58  
 59  ### In a pipeline
 60  
 61  Here's how you can use `RecursiveSplitter` in an indexing pipeline:
 62  
 63  ```python
 64  from pathlib import Path
 65  
 66  from haystack import Document
 67  from haystack import Pipeline
 68  from haystack.document_stores.in_memory import InMemoryDocumentStore
 69  from haystack.components.converters.txt import TextFileToDocument
 70  from haystack.components.preprocessors import DocumentCleaner
 71  from haystack.components.preprocessors import RecursiveDocumentSplitter
 72  from haystack.components.writers import DocumentWriter
 73  
 74  document_store = InMemoryDocumentStore()
 75  p = Pipeline()
 76  p.add_component(instance=TextFileToDocument(), name="text_file_converter")
 77  p.add_component(instance=DocumentCleaner(), name="cleaner")
 78  p.add_component(
 79      instance=RecursiveDocumentSplitter(
 80          split_length=400,
 81          split_overlap=0,
 82          split_unit="char",
 83          separators=["\n\n", "\n", "sentence", " "],
 84          sentence_splitter_params={
 85              "language": "en",
 86              "use_split_rules": True,
 87              "keep_white_spaces": False,
 88          },
 89      ),
 90      name="recursive_splitter",
 91  )
 92  p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
 93  p.connect("text_file_converter.documents", "cleaner.documents")
 94  p.connect("cleaner.documents", "splitter.documents")
 95  p.connect("splitter.documents", "writer.documents")
 96  
 97  path = "path/to/your/files"
 98  files = list(Path(path).glob("*.md"))
 99  p.run({"text_file_converter": {"sources": files}})
100  ```