preprocessors_api.md
  1  ---
  2  title: "PreProcessors"
  3  id: preprocessors-api
  4  description: "Preprocess your Documents and texts. Clean, split, and more."
  5  slug: "/preprocessors-api"
  6  ---
  7  
  8  <a id="csv_document_cleaner"></a>
  9  
 10  ## Module csv\_document\_cleaner
 11  
 12  <a id="csv_document_cleaner.CSVDocumentCleaner"></a>
 13  
 14  ### CSVDocumentCleaner
 15  
 16  A component for cleaning CSV documents by removing empty rows and columns.
 17  
 18  This component processes CSV content stored in Documents, allowing
 19  for the optional ignoring of a specified number of rows and columns before performing
 20  the cleaning operation. Additionally, it provides options to keep document IDs and
 21  control whether empty rows and columns should be removed.
 22  
 23  <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a>
 24  
 25  #### CSVDocumentCleaner.\_\_init\_\_
 26  
 27  ```python
 28  def __init__(*,
 29               ignore_rows: int = 0,
 30               ignore_columns: int = 0,
 31               remove_empty_rows: bool = True,
 32               remove_empty_columns: bool = True,
 33               keep_id: bool = False) -> None
 34  ```
 35  
 36  Initializes the CSVDocumentCleaner component.
 37  
 38  **Arguments**:
 39  
 40  - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing.
 41  - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing.
 42  - `remove_empty_rows`: Whether to remove rows that are entirely empty.
 43  - `remove_empty_columns`: Whether to remove columns that are entirely empty.
 44  - `keep_id`: Whether to retain the original document ID in the output document.
 45  Rows and columns ignored using these parameters are preserved in the final output, meaning
 46  they are not considered when removing empty rows and columns.
 47  
 48  <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a>
 49  
 50  #### CSVDocumentCleaner.run
 51  
 52  ```python
 53  @component.output_types(documents=list[Document])
 54  def run(documents: list[Document]) -> dict[str, list[Document]]
 55  ```
 56  
 57  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
 58  
 59  **Arguments**:
 60  
 61  - `documents`: List of Documents containing CSV-formatted content.
 62  
 63  **Returns**:
 64  
 65  A dictionary with a list of cleaned Documents under the key "documents".
 66  Processing steps:
 67  1. Reads each document's content as a CSV table.
 68  2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
 69  3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
 70      `remove_empty_columns`).
 71  4. Reattaches the ignored rows and columns to maintain their original positions.
 72  5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
 73      document ID.
 74  
 75  <a id="csv_document_splitter"></a>
 76  
 77  ## Module csv\_document\_splitter
 78  
 79  <a id="csv_document_splitter.CSVDocumentSplitter"></a>
 80  
 81  ### CSVDocumentSplitter
 82  
 83  A component for splitting CSV documents into sub-tables based on split arguments.
 84  
 85  The splitter supports two modes of operation:
 86  - identify consecutive empty rows or columns that exceed a given threshold
 87  and uses them as delimiters to segment the document into smaller tables.
 88  - split each row into a separate sub-table, represented as a Document.
 89  
 90  <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a>
 91  
 92  #### CSVDocumentSplitter.\_\_init\_\_
 93  
 94  ```python
 95  def __init__(row_split_threshold: int | None = 2,
 96               column_split_threshold: int | None = 2,
 97               read_csv_kwargs: dict[str, Any] | None = None,
 98               split_mode: SplitMode = "threshold") -> None
 99  ```
100  
101  Initializes the CSVDocumentSplitter component.
102  
103  **Arguments**:
104  
105  - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split.
106  - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split.
107  - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`.
108  By default, the component with options:
109  - `header=None`
110  - `skip_blank_lines=False` to preserve blank lines
111  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
112  See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
113  - `split_mode`: If `threshold`, the component will split the document based on the number of
114  consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
115  If `row-wise`, the component will split each row into a separate sub-table.
116  
117  <a id="csv_document_splitter.CSVDocumentSplitter.run"></a>
118  
119  #### CSVDocumentSplitter.run
120  
121  ```python
122  @component.output_types(documents=list[Document])
123  def run(documents: list[Document]) -> dict[str, list[Document]]
124  ```
125  
126  Processes and splits a list of CSV documents into multiple sub-tables.
127  
128  **Splitting Process:**
129  1. Applies a row-based split if `row_split_threshold` is provided.
130  2. Applies a column-based split if `column_split_threshold` is provided.
131  3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
132     further fragmentation of any sub-tables that still contain empty sections.
133  4. Sorts the resulting sub-tables based on their original positions within the document.
134  
135  **Arguments**:
136  
137  - `documents`: A list of Documents containing CSV-formatted content.
138  Each document is assumed to contain one or more tables separated by empty rows or columns.
139  
140  **Returns**:
141  
142  A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
143  each representing an extracted sub-table from the original CSV.
144      The metadata of each document includes:
145          - A field `source_id` to track the original document.
146          - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
147          - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
148          - A field `split_id` to indicate the order of the split in the original document.
149          - All other metadata copied from the original document.
150  
151  - If a document cannot be processed, it is returned unchanged.
152  - The `meta` field from the original document is preserved in the split documents.
153  
154  <a id="document_cleaner"></a>
155  
156  ## Module document\_cleaner
157  
158  <a id="document_cleaner.DocumentCleaner"></a>
159  
160  ### DocumentCleaner
161  
162  Cleans the text in the documents.
163  
164  It removes extra whitespaces,
165  empty lines, specified substrings, regexes,
166  page headers and footers (in this order).
167  
168  ### Usage example:
169  
170  ```python
171  from haystack import Document
172  from haystack.components.preprocessors import DocumentCleaner
173  
174  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
175  
176  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
177  result = cleaner.run(documents=[doc])
178  
179  assert result["documents"][0].content == "This is a document to clean "
180  ```
181  
182  <a id="document_cleaner.DocumentCleaner.__init__"></a>
183  
184  #### DocumentCleaner.\_\_init\_\_
185  
186  ```python
187  def __init__(remove_empty_lines: bool = True,
188               remove_extra_whitespaces: bool = True,
189               remove_repeated_substrings: bool = False,
190               keep_id: bool = False,
191               remove_substrings: list[str] | None = None,
192               remove_regex: str | None = None,
193               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
194               | None = None,
195               ascii_only: bool = False)
196  ```
197  
198  Initialize DocumentCleaner.
199  
200  **Arguments**:
201  
202  - `remove_empty_lines`: If `True`, removes empty lines.
203  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
204  - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages.
205  Pages must be separated by a form feed character "\f",
206  which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
207  - `remove_substrings`: List of substrings to remove from the text.
208  - `remove_regex`: Regex to match and replace substrings by "".
209  - `keep_id`: If `True`, keeps the IDs of the original documents.
210  - `unicode_normalization`: Unicode normalization form to apply to the text.
211  Note: This will run before any other steps.
212  - `ascii_only`: Whether to convert the text to ASCII only.
213  Will remove accents from characters and replace them with ASCII characters.
214  Other non-ASCII characters will be removed.
215  Note: This will run before any pattern matching or removal.
216  
217  <a id="document_cleaner.DocumentCleaner.run"></a>
218  
219  #### DocumentCleaner.run
220  
221  ```python
222  @component.output_types(documents=list[Document])
223  def run(documents: list[Document])
224  ```
225  
226  Cleans up the documents.
227  
228  **Arguments**:
229  
230  - `documents`: List of Documents to clean.
231  
232  **Raises**:
233  
234  - `TypeError`: if documents is not a list of Documents.
235  
236  **Returns**:
237  
238  A dictionary with the following key:
239  - `documents`: List of cleaned Documents.
240  
241  <a id="document_preprocessor"></a>
242  
243  ## Module document\_preprocessor
244  
245  <a id="document_preprocessor.DocumentPreprocessor"></a>
246  
247  ### DocumentPreprocessor
248  
249  A SuperComponent that first splits and then cleans documents.
250  
251  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
252  It takes a list of documents as input and returns a processed list of documents.
253  
254  Usage example:
255  ```python
256  from haystack import Document
257  from haystack.components.preprocessors import DocumentPreprocessor
258  
259  doc = Document(content="I love pizza!")
260  preprocessor = DocumentPreprocessor()
261  result = preprocessor.run(documents=[doc])
262  print(result["documents"])
263  ```
264  
265  <a id="document_preprocessor.DocumentPreprocessor.__init__"></a>
266  
267  #### DocumentPreprocessor.\_\_init\_\_
268  
269  ```python
270  def __init__(*,
271               split_by: Literal["function", "page", "passage", "period", "word",
272                                 "line", "sentence"] = "word",
273               split_length: int = 250,
274               split_overlap: int = 0,
275               split_threshold: int = 0,
276               splitting_function: Callable[[str], list[str]] | None = None,
277               respect_sentence_boundary: bool = False,
278               language: Language = "en",
279               use_split_rules: bool = True,
280               extend_abbreviations: bool = True,
281               remove_empty_lines: bool = True,
282               remove_extra_whitespaces: bool = True,
283               remove_repeated_substrings: bool = False,
284               keep_id: bool = False,
285               remove_substrings: list[str] | None = None,
286               remove_regex: str | None = None,
287               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
288               | None = None,
289               ascii_only: bool = False) -> None
290  ```
291  
292  Initialize a DocumentPreProcessor that first splits and then cleans documents.
293  
294  **Splitter Parameters**:
295  
296  **Arguments**:
297  
298  - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
299  - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split.
300  - `split_overlap`: The number of overlapping units between consecutive splits.
301  - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged
302  with the previous split.
303  - `splitting_function`: A custom function for splitting if `split_by="function"`.
304  - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence.
305  - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or
306  `respect_sentence_boundary=True`.
307  - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter.
308  - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain
309  languages.
310  
311  **Cleaner Parameters**:
312  - `remove_empty_lines`: If `True`, removes empty lines.
313  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
314  - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages.
315  - `keep_id`: If `True`, keeps the original document IDs.
316  - `remove_substrings`: A list of strings to remove from the document content.
317  - `remove_regex`: A regex pattern whose matches will be removed from the document content.
318  - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`.
319  - `ascii_only`: If `True`, converts text to ASCII only.
320  
321  <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a>
322  
323  #### DocumentPreprocessor.to\_dict
324  
325  ```python
326  def to_dict() -> dict[str, Any]
327  ```
328  
329  Serialize SuperComponent to a dictionary.
330  
331  **Returns**:
332  
333  Dictionary with serialized data.
334  
335  <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a>
336  
337  #### DocumentPreprocessor.from\_dict
338  
339  ```python
340  @classmethod
341  def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor"
342  ```
343  
344  Deserializes the SuperComponent from a dictionary.
345  
346  **Arguments**:
347  
348  - `data`: Dictionary to deserialize from.
349  
350  **Returns**:
351  
352  Deserialized SuperComponent.
353  
354  <a id="document_splitter"></a>
355  
356  ## Module document\_splitter
357  
358  <a id="document_splitter.DocumentSplitter"></a>
359  
360  ### DocumentSplitter
361  
362  Splits long documents into smaller chunks.
363  
364  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
365  and prevents exceeding language model context limits.
366  
367  The DocumentSplitter is compatible with the following DocumentStores:
368  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
369  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
370    not stored
371  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
372  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
373  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
374  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
375     information is not stored
376  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
377  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
378  
379  ### Usage example
380  
381  ```python
382  from haystack import Document
383  from haystack.components.preprocessors import DocumentSplitter
384  
385  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
386  
387  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
388  result = splitter.run(documents=[doc])
389  ```
390  
391  <a id="document_splitter.DocumentSplitter.__init__"></a>
392  
393  #### DocumentSplitter.\_\_init\_\_
394  
395  ```python
396  def __init__(split_by: Literal["function", "page", "passage", "period", "word",
397                                 "line", "sentence"] = "word",
398               split_length: int = 200,
399               split_overlap: int = 0,
400               split_threshold: int = 0,
401               splitting_function: Callable[[str], list[str]] | None = None,
402               respect_sentence_boundary: bool = False,
403               language: Language = "en",
404               use_split_rules: bool = True,
405               extend_abbreviations: bool = True,
406               *,
407               skip_empty_documents: bool = True)
408  ```
409  
410  Initialize DocumentSplitter.
411  
412  **Arguments**:
413  
414  - `split_by`: The unit for splitting your documents. Choose from:
415  - `word` for splitting by spaces (" ")
416  - `period` for splitting by periods (".")
417  - `page` for splitting by form feed ("\f")
418  - `passage` for splitting by double line breaks ("\n\n")
419  - `line` for splitting each line ("\n")
420  - `sentence` for splitting by NLTK sentence tokenizer
421  - `split_length`: The maximum number of units in each split.
422  - `split_overlap`: The number of overlapping units for each split.
423  - `split_threshold`: The minimum number of units per split. If a split has fewer units
424  than the threshold, it's attached to the previous split.
425  - `splitting_function`: Necessary when `split_by` is set to "function".
426  This is a function which must accept a single `str` as input and return a `list` of `str` as output,
427  representing the chunks after splitting.
428  - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
429  If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
430  - `language`: Choose the language for the NLTK tokenizer. The default is English ("en").
431  - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`.
432  - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
433  of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
434  - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True.
435  Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
436  from non-textual documents.
437  
438  <a id="document_splitter.DocumentSplitter.warm_up"></a>
439  
440  #### DocumentSplitter.warm\_up
441  
442  ```python
443  def warm_up()
444  ```
445  
446  Warm up the DocumentSplitter by loading the sentence tokenizer.
447  
448  <a id="document_splitter.DocumentSplitter.run"></a>
449  
450  #### DocumentSplitter.run
451  
452  ```python
453  @component.output_types(documents=list[Document])
454  def run(documents: list[Document])
455  ```
456  
457  Split documents into smaller parts.
458  
459  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
460  and an overlap of `split_overlap`.
461  
462  **Arguments**:
463  
464  - `documents`: The documents to split.
465  
466  **Raises**:
467  
468  - `TypeError`: if the input is not a list of Documents.
469  - `ValueError`: if the content of a document is None.
470  
471  **Returns**:
472  
473  A dictionary with the following key:
474  - `documents`: List of documents with the split texts. Each document includes:
475  - A metadata field `source_id` to track the original document.
476  - A metadata field `page_number` to track the original page number.
477  - All other metadata copied from the original document.
478  
479  <a id="document_splitter.DocumentSplitter.to_dict"></a>
480  
481  #### DocumentSplitter.to\_dict
482  
483  ```python
484  def to_dict() -> dict[str, Any]
485  ```
486  
487  Serializes the component to a dictionary.
488  
489  <a id="document_splitter.DocumentSplitter.from_dict"></a>
490  
491  #### DocumentSplitter.from\_dict
492  
493  ```python
494  @classmethod
495  def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter"
496  ```
497  
498  Deserializes the component from a dictionary.
499  
500  <a id="hierarchical_document_splitter"></a>
501  
502  ## Module hierarchical\_document\_splitter
503  
504  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a>
505  
506  ### HierarchicalDocumentSplitter
507  
508  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
509  
510  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
511  are connected such that the smaller blocks are children of the parent-larger blocks.
512  
513  ## Usage example
514  ```python
515  from haystack import Document
516  from haystack.components.preprocessors import HierarchicalDocumentSplitter
517  
518  doc = Document(content="This is a simple test document")
519  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
520  splitter.run([doc])
521  >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
522  >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
523  >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
524  >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
525  >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
526  >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
527  >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
528  ```
529  
530  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a>
531  
532  #### HierarchicalDocumentSplitter.\_\_init\_\_
533  
534  ```python
535  def __init__(block_sizes: set[int],
536               split_overlap: int = 0,
537               split_by: Literal["word", "sentence", "page",
538                                 "passage"] = "word")
539  ```
540  
541  Initialize HierarchicalDocumentSplitter.
542  
543  **Arguments**:
544  
545  - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order.
546  - `split_overlap`: The number of overlapping units for each split.
547  - `split_by`: The unit for splitting your documents.
548  
549  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a>
550  
551  #### HierarchicalDocumentSplitter.run
552  
553  ```python
554  @component.output_types(documents=list[Document])
555  def run(documents: list[Document])
556  ```
557  
558  Builds a hierarchical document structure for each document in a list of documents.
559  
560  **Arguments**:
561  
562  - `documents`: List of Documents to split into hierarchical blocks.
563  
564  **Returns**:
565  
566  List of HierarchicalDocument
567  
568  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a>
569  
570  #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc
571  
572  ```python
573  def build_hierarchy_from_doc(document: Document) -> list[Document]
574  ```
575  
576  Build a hierarchical tree document structure from a single document.
577  
578  Given a document, this function splits the document into hierarchical blocks of different sizes represented
579  as HierarchicalDocument objects.
580  
581  **Arguments**:
582  
583  - `document`: Document to split into hierarchical blocks.
584  
585  **Returns**:
586  
587  List of HierarchicalDocument
588  
589  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a>
590  
591  #### HierarchicalDocumentSplitter.to\_dict
592  
593  ```python
594  def to_dict() -> dict[str, Any]
595  ```
596  
597  Returns a dictionary representation of the component.
598  
599  **Returns**:
600  
601  Serialized dictionary representation of the component.
602  
603  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a>
604  
605  #### HierarchicalDocumentSplitter.from\_dict
606  
607  ```python
608  @classmethod
609  def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter"
610  ```
611  
612  Deserialize this component from a dictionary.
613  
614  **Arguments**:
615  
616  - `data`: The dictionary to deserialize and create the component.
617  
618  **Returns**:
619  
620  The deserialized component.
621  
622  <a id="recursive_splitter"></a>
623  
624  ## Module recursive\_splitter
625  
626  <a id="recursive_splitter.RecursiveDocumentSplitter"></a>
627  
628  ### RecursiveDocumentSplitter
629  
630  Recursively chunk text into smaller chunks.
631  
632  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
633  to the text.
634  
635  The separators are applied in the order they are provided, typically this is a list of separators that are
636  applied in a specific order, being the last separator the most specific one.
637  
638  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
639  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
640  list to the remaining text.
641  
642  This is done until all chunks are smaller than the split_length parameter.
643  
644  **Example**:
645  
646    
647  ```python
648  from haystack import Document
649  from haystack.components.preprocessors import RecursiveDocumentSplitter
650  
651  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
652  text = ('''Artificial intelligence (AI) - Introduction
653  
654  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
655  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
656  chunker.warm_up()
657  doc = Document(content=text)
658  doc_chunks = chunker.run([doc])
659  print(doc_chunks["documents"])
660  >[
661  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
662  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
663  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
664  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
665  >]
666  ```
667  
668  <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a>
669  
670  #### RecursiveDocumentSplitter.\_\_init\_\_
671  
672  ```python
673  def __init__(*,
674               split_length: int = 200,
675               split_overlap: int = 0,
676               split_unit: Literal["word", "char", "token"] = "word",
677               separators: list[str] | None = None,
678               sentence_splitter_params: dict[str, Any] | None = None)
679  ```
680  
681  Initializes a RecursiveDocumentSplitter.
682  
683  **Arguments**:
684  
685  - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens.
686  See the `split_units` parameter.
687  - `split_overlap`: The number of characters to overlap between consecutive chunks.
688  - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token".
689  If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
690  - `separators`: An optional list of separator strings to use for splitting the text. The string
691  separators will be treated as regular expressions unless the separator is "sentence", in that case the
692  text will be split into sentences using a custom sentence tokenizer based on NLTK.
693  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
694  If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used.
695  - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer.
696  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
697  
698  **Raises**:
699  
700  - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
701  if any separator is not a string.
702  
703  <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a>
704  
705  #### RecursiveDocumentSplitter.warm\_up
706  
707  ```python
708  def warm_up() -> None
709  ```
710  
711  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
712  
713  <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a>
714  
715  #### RecursiveDocumentSplitter.run
716  
717  ```python
718  @component.output_types(documents=list[Document])
719  def run(documents: list[Document]) -> dict[str, list[Document]]
720  ```
721  
722  Split a list of documents into documents with smaller chunks of text.
723  
724  **Arguments**:
725  
726  - `documents`: List of Documents to split.
727  
728  **Returns**:
729  
730  A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
731  to the input documents.
732  
733  <a id="text_cleaner"></a>
734  
735  ## Module text\_cleaner
736  
737  <a id="text_cleaner.TextCleaner"></a>
738  
739  ### TextCleaner
740  
741  Cleans text strings.
742  
743  It can remove substrings matching a list of regular expressions, convert text to lowercase,
744  remove punctuation, and remove numbers.
745  Use it to clean up text data before evaluation.
746  
747  ### Usage example
748  
749  ```python
750  from haystack.components.preprocessors import TextCleaner
751  
752  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
753  
754  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
755  result = cleaner.run(texts=[text_to_clean])
756  ```
757  
758  <a id="text_cleaner.TextCleaner.__init__"></a>
759  
760  #### TextCleaner.\_\_init\_\_
761  
762  ```python
763  def __init__(remove_regexps: list[str] | None = None,
764               convert_to_lowercase: bool = False,
765               remove_punctuation: bool = False,
766               remove_numbers: bool = False)
767  ```
768  
769  Initializes the TextCleaner component.
770  
771  **Arguments**:
772  
773  - `remove_regexps`: A list of regex patterns to remove matching substrings from the text.
774  - `convert_to_lowercase`: If `True`, converts all characters to lowercase.
775  - `remove_punctuation`: If `True`, removes punctuation from the text.
776  - `remove_numbers`: If `True`, removes numerical digits from the text.
777  
778  <a id="text_cleaner.TextCleaner.run"></a>
779  
780  #### TextCleaner.run
781  
782  ```python
783  @component.output_types(texts=list[str])
784  def run(texts: list[str]) -> dict[str, Any]
785  ```
786  
787  Cleans up the given list of strings.
788  
789  **Arguments**:
790  
791  - `texts`: List of strings to clean.
792  
793  **Returns**:
794  
795  A dictionary with the following key:
796  - `texts`:  the cleaned list of strings.
797