Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.28 / haystack-api / preprocessors_api.md
preprocessors_api.md
  1  ---
  2  title: "PreProcessors"
  3  id: preprocessors-api
  4  description: "Preprocess your Documents and texts. Clean, split, and more."
  5  slug: "/preprocessors-api"
  6  ---
  7  
  8  
  9  ## csv_document_cleaner
 10  
 11  ### CSVDocumentCleaner
 12  
 13  A component for cleaning CSV documents by removing empty rows and columns.
 14  
 15  This component processes CSV content stored in Documents, allowing
 16  for the optional ignoring of a specified number of rows and columns before performing
 17  the cleaning operation. Additionally, it provides options to keep document IDs and
 18  control whether empty rows and columns should be removed.
 19  
 20  #### __init__
 21  
 22  ```python
 23  __init__(
 24      *,
 25      ignore_rows: int = 0,
 26      ignore_columns: int = 0,
 27      remove_empty_rows: bool = True,
 28      remove_empty_columns: bool = True,
 29      keep_id: bool = False
 30  ) -> None
 31  ```
 32  
 33  Initializes the CSVDocumentCleaner component.
 34  
 35  **Parameters:**
 36  
 37  - **ignore_rows** (<code>int</code>) – Number of rows to ignore from the top of the CSV table before processing.
 38  - **ignore_columns** (<code>int</code>) – Number of columns to ignore from the left of the CSV table before processing.
 39  - **remove_empty_rows** (<code>bool</code>) – Whether to remove rows that are entirely empty.
 40  - **remove_empty_columns** (<code>bool</code>) – Whether to remove columns that are entirely empty.
 41  - **keep_id** (<code>bool</code>) – Whether to retain the original document ID in the output document.
 42  
 43  Rows and columns ignored using these parameters are preserved in the final output, meaning
 44  they are not considered when removing empty rows and columns.
 45  
 46  #### run
 47  
 48  ```python
 49  run(documents: list[Document]) -> dict[str, list[Document]]
 50  ```
 51  
 52  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
 53  
 54  **Parameters:**
 55  
 56  - **documents** (<code>list\[Document\]</code>) – List of Documents containing CSV-formatted content.
 57  
 58  **Returns:**
 59  
 60  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a list of cleaned Documents under the key "documents".
 61  
 62  Processing steps:
 63  
 64  1. Reads each document's content as a CSV table.
 65  1. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
 66  1. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
 67     `remove_empty_columns`).
 68  1. Reattaches the ignored rows and columns to maintain their original positions.
 69  1. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
 70     document ID.
 71  
 72  ## csv_document_splitter
 73  
 74  ### CSVDocumentSplitter
 75  
 76  A component for splitting CSV documents into sub-tables based on split arguments.
 77  
 78  The splitter supports two modes of operation:
 79  
 80  - identify consecutive empty rows or columns that exceed a given threshold
 81    and uses them as delimiters to segment the document into smaller tables.
 82  - split each row into a separate sub-table, represented as a Document.
 83  
 84  #### __init__
 85  
 86  ```python
 87  __init__(
 88      row_split_threshold: int | None = 2,
 89      column_split_threshold: int | None = 2,
 90      read_csv_kwargs: dict[str, Any] | None = None,
 91      split_mode: SplitMode = "threshold",
 92  ) -> None
 93  ```
 94  
 95  Initializes the CSVDocumentSplitter component.
 96  
 97  **Parameters:**
 98  
 99  - **row_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty rows required to trigger a split.
100  - **column_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty columns required to trigger a split.
101  - **read_csv_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to `pandas.read_csv`.
102    By default, the component with options:
103  - `header=None`
104  - `skip_blank_lines=False` to preserve blank lines
105  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
106    See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
107  - **split_mode** (<code>SplitMode</code>) – If `threshold`, the component will split the document based on the number of
108    consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
109    If `row-wise`, the component will split each row into a separate sub-table.
110  
111  #### run
112  
113  ```python
114  run(documents: list[Document]) -> dict[str, list[Document]]
115  ```
116  
117  Processes and splits a list of CSV documents into multiple sub-tables.
118  
119  **Splitting Process:**
120  
121  1. Applies a row-based split if `row_split_threshold` is provided.
122  1. Applies a column-based split if `column_split_threshold` is provided.
123  1. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
124     further fragmentation of any sub-tables that still contain empty sections.
125  1. Sorts the resulting sub-tables based on their original positions within the document.
126  
127  **Parameters:**
128  
129  - **documents** (<code>list\[Document\]</code>) – A list of Documents containing CSV-formatted content.
130    Each document is assumed to contain one or more tables separated by empty rows or columns.
131  
132  **Returns:**
133  
134  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
135    each representing an extracted sub-table from the original CSV.
136    The metadata of each document includes:
137    \- A field `source_id` to track the original document.
138    \- A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
139    \- A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
140    \- A field `split_id` to indicate the order of the split in the original document.
141    \- All other metadata copied from the original document.
142  
143  - If a document cannot be processed, it is returned unchanged.
144  
145  - The `meta` field from the original document is preserved in the split documents.
146  
147  ## document_cleaner
148  
149  ### DocumentCleaner
150  
151  Cleans the text in the documents.
152  
153  It removes extra whitespaces,
154  empty lines, specified substrings, regexes,
155  page headers and footers (in this order).
156  
157  ### Usage example:
158  
159  ```python
160  from haystack import Document
161  from haystack.components.preprocessors import DocumentCleaner
162  
163  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
164  
165  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
166  result = cleaner.run(documents=[doc])
167  
168  assert result["documents"][0].content == "This is a document to clean "
169  ```
170  
171  #### __init__
172  
173  ```python
174  __init__(
175      remove_empty_lines: bool = True,
176      remove_extra_whitespaces: bool = True,
177      remove_repeated_substrings: bool = False,
178      keep_id: bool = False,
179      remove_substrings: list[str] | None = None,
180      remove_regex: str | None = None,
181      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
182      ascii_only: bool = False,
183      strip_whitespaces: bool = False,
184      replace_regexes: dict[str, str] | None = None,
185  ) -> None
186  ```
187  
188  Initialize DocumentCleaner.
189  
190  **Parameters:**
191  
192  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
193  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
194  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings (headers and footers) from pages.
195    Pages must be separated by a form feed character "\\f",
196    which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
197  - **remove_substrings** (<code>list\[str\] | None</code>) – List of substrings to remove from the text.
198  - **remove_regex** (<code>str | None</code>) – Regex to match and replace substrings by "".
199  - **keep_id** (<code>bool</code>) – If `True`, keeps the IDs of the original documents.
200  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text.
201    Note: This will run before any other steps.
202  - **ascii_only** (<code>bool</code>) – Whether to convert the text to ASCII only.
203    Will remove accents from characters and replace them with ASCII characters.
204    Other non-ASCII characters will be removed.
205    Note: This will run before any pattern matching or removal.
206  - **strip_whitespaces** (<code>bool</code>) – If `True`, removes leading and trailing whitespace from the document content
207    using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning
208    and end of the text, preserving internal whitespace (useful for markdown formatting).
209  - **replace_regexes** (<code>dict\[str, str\] | None</code>) – A dictionary mapping regex patterns to their replacement strings.
210    For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline.
211    This is applied after `remove_regex` and allows custom replacements instead of just removal.
212  
213  #### run
214  
215  ```python
216  run(documents: list[Document]) -> dict[str, list[Document]]
217  ```
218  
219  Cleans up the documents.
220  
221  **Parameters:**
222  
223  - **documents** (<code>list\[Document\]</code>) – List of Documents to clean.
224  
225  **Returns:**
226  
227  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
228  - `documents`: List of cleaned Documents.
229  
230  **Raises:**
231  
232  - <code>TypeError</code> – if documents is not a list of Documents.
233  
234  ## document_preprocessor
235  
236  ### DocumentPreprocessor
237  
238  A SuperComponent that first splits and then cleans documents.
239  
240  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
241  It takes a list of documents as input and returns a processed list of documents.
242  
243  Usage example:
244  
245  ```python
246  from haystack import Document
247  from haystack.components.preprocessors import DocumentPreprocessor
248  
249  doc = Document(content="I love pizza!")
250  preprocessor = DocumentPreprocessor()
251  result = preprocessor.run(documents=[doc])
252  print(result["documents"])
253  ```
254  
255  #### __init__
256  
257  ```python
258  __init__(
259      *,
260      split_by: Literal[
261          "function", "page", "passage", "period", "word", "line", "sentence"
262      ] = "word",
263      split_length: int = 250,
264      split_overlap: int = 0,
265      split_threshold: int = 0,
266      splitting_function: Callable[[str], list[str]] | None = None,
267      respect_sentence_boundary: bool = False,
268      language: Language = "en",
269      use_split_rules: bool = True,
270      extend_abbreviations: bool = True,
271      remove_empty_lines: bool = True,
272      remove_extra_whitespaces: bool = True,
273      remove_repeated_substrings: bool = False,
274      keep_id: bool = False,
275      remove_substrings: list[str] | None = None,
276      remove_regex: str | None = None,
277      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
278      ascii_only: bool = False
279  ) -> None
280  ```
281  
282  Initialize a DocumentPreProcessor that first splits and then cleans documents.
283  
284  **Splitter Parameters**:
285  
286  **Parameters:**
287  
288  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
289  - **split_length** (<code>int</code>) – The maximum number of units (words, lines, pages, and so on) in each split.
290  - **split_overlap** (<code>int</code>) – The number of overlapping units between consecutive splits.
291  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split is smaller than this, it's merged
292    with the previous split.
293  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – A custom function for splitting if `split_by="function"`.
294  - **respect_sentence_boundary** (<code>bool</code>) – If `True`, splits by words but tries not to break inside a sentence.
295  - **language** (<code>Language</code>) – Language used by the sentence tokenizer if `split_by="sentence"` or
296    `respect_sentence_boundary=True`.
297  - **use_split_rules** (<code>bool</code>) – Whether to apply additional splitting heuristics for the sentence splitter.
298  - **extend_abbreviations** (<code>bool</code>) – Whether to extend the sentence splitter with curated abbreviations for certain
299    languages.
300  
301  **Cleaner Parameters**:
302  
303  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
304  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
305  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings like headers/footers across pages.
306  - **keep_id** (<code>bool</code>) – If `True`, keeps the original document IDs.
307  - **remove_substrings** (<code>list\[str\] | None</code>) – A list of strings to remove from the document content.
308  - **remove_regex** (<code>str | None</code>) – A regex pattern whose matches will be removed from the document content.
309  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text, for example `"NFC"`.
310  - **ascii_only** (<code>bool</code>) – If `True`, converts text to ASCII only.
311  
312  #### to_dict
313  
314  ```python
315  to_dict() -> dict[str, Any]
316  ```
317  
318  Serialize SuperComponent to a dictionary.
319  
320  **Returns:**
321  
322  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
323  
324  #### from_dict
325  
326  ```python
327  from_dict(data: dict[str, Any]) -> DocumentPreprocessor
328  ```
329  
330  Deserializes the SuperComponent from a dictionary.
331  
332  **Parameters:**
333  
334  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
335  
336  **Returns:**
337  
338  - <code>DocumentPreprocessor</code> – Deserialized SuperComponent.
339  
340  ## document_splitter
341  
342  ### DocumentSplitter
343  
344  Splits long documents into smaller chunks.
345  
346  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
347  and prevents exceeding language model context limits.
348  
349  The DocumentSplitter is compatible with the following DocumentStores:
350  
351  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
352  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
353    not stored
354  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
355  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
356  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
357  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
358    information is not stored
359  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
360  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
361  
362  ### Usage example
363  
364  ```python
365  from haystack import Document
366  from haystack.components.preprocessors import DocumentSplitter
367  
368  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
369  
370  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
371  result = splitter.run(documents=[doc])
372  ```
373  
374  #### __init__
375  
376  ```python
377  __init__(
378      split_by: Literal[
379          "function", "page", "passage", "period", "word", "line", "sentence"
380      ] = "word",
381      split_length: int = 200,
382      split_overlap: int = 0,
383      split_threshold: int = 0,
384      splitting_function: Callable[[str], list[str]] | None = None,
385      respect_sentence_boundary: bool = False,
386      language: Language = "en",
387      use_split_rules: bool = True,
388      extend_abbreviations: bool = True,
389      *,
390      skip_empty_documents: bool = True
391  ) -> None
392  ```
393  
394  Initialize DocumentSplitter.
395  
396  **Parameters:**
397  
398  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit for splitting your documents. Choose from:
399  - `word` for splitting by spaces (" ")
400  - `period` for splitting by periods (".")
401  - `page` for splitting by form feed ("\\f")
402  - `passage` for splitting by double line breaks ("\\n\\n")
403  - `line` for splitting each line ("\\n")
404  - `sentence` for splitting by NLTK sentence tokenizer
405  - **split_length** (<code>int</code>) – The maximum number of units in each split.
406  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
407  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
408    than the threshold, it's attached to the previous split.
409  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – Necessary when `split_by` is set to "function".
410    This is a function which must accept a single `str` as input and return a `list` of `str` as output,
411    representing the chunks after splitting.
412  - **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
413    If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
414  - **language** (<code>Language</code>) – Choose the language for the NLTK tokenizer. The default is English ("en").
415  - **use_split_rules** (<code>bool</code>) – Choose whether to use additional split rules when splitting by `sentence`.
416  - **extend_abbreviations** (<code>bool</code>) – Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
417    of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
418  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
419    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
420    from non-textual documents.
421  
422  #### warm_up
423  
424  ```python
425  warm_up() -> None
426  ```
427  
428  Warm up the DocumentSplitter by loading the sentence tokenizer.
429  
430  #### run
431  
432  ```python
433  run(documents: list[Document]) -> dict[str, list[Document]]
434  ```
435  
436  Split documents into smaller parts.
437  
438  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
439  and an overlap of `split_overlap`.
440  
441  **Parameters:**
442  
443  - **documents** (<code>list\[Document\]</code>) – The documents to split.
444  
445  **Returns:**
446  
447  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
448  - `documents`: List of documents with the split texts. Each document includes:
449    - A metadata field `source_id` to track the original document.
450    - A metadata field `page_number` to track the original page number.
451    - All other metadata copied from the original document.
452  
453  **Raises:**
454  
455  - <code>TypeError</code> – if the input is not a list of Documents.
456  - <code>ValueError</code> – if the content of a document is None.
457  
458  #### to_dict
459  
460  ```python
461  to_dict() -> dict[str, Any]
462  ```
463  
464  Serializes the component to a dictionary.
465  
466  #### from_dict
467  
468  ```python
469  from_dict(data: dict[str, Any]) -> DocumentSplitter
470  ```
471  
472  Deserializes the component from a dictionary.
473  
474  ## embedding_based_document_splitter
475  
476  ### EmbeddingBasedDocumentSplitter
477  
478  Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
479  
480  This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
481  and then uses cosine distance between sequential embeddings to determine split points. Any distance above
482  the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
483  characters (``) in the original document.
484  
485  This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
486  
487  ### Usage example
488  
489  ```python
490  from haystack import Document
491  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
492  from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
493  
494  # Create a document with content that has a clear topic shift
495  doc = Document(
496      content="This is a first sentence. This is a second sentence. This is a third sentence. "
497      "Completely different topic. The same completely different topic."
498  )
499  
500  # Initialize the embedder to calculate semantic similarities
501  embedder = SentenceTransformersDocumentEmbedder()
502  
503  # Configure the splitter with parameters that control splitting behavior
504  splitter = EmbeddingBasedDocumentSplitter(
505      document_embedder=embedder,
506      sentences_per_group=2,      # Group 2 sentences before calculating embeddings
507      percentile=0.95,            # Split when cosine distance exceeds 95th percentile
508      min_length=50,              # Merge splits shorter than 50 characters
509      max_length=1000             # Further split chunks longer than 1000 characters
510  )
511  result = splitter.run(documents=[doc])
512  
513  # The result contains a list of Document objects, each representing a semantic chunk
514  # Each split document includes metadata: source_id, split_id, and page_number
515  print(f"Original document split into {len(result['documents'])} chunks")
516  for i, split_doc in enumerate(result['documents']):
517      print(f"Chunk {i}: {split_doc.content[:50]}...")
518  ```
519  
520  #### __init__
521  
522  ```python
523  __init__(
524      *,
525      document_embedder: DocumentEmbedder,
526      sentences_per_group: int = 3,
527      percentile: float = 0.95,
528      min_length: int = 50,
529      max_length: int = 1000,
530      language: Language = "en",
531      use_split_rules: bool = True,
532      extend_abbreviations: bool = True
533  ) -> None
534  ```
535  
536  Initialize EmbeddingBasedDocumentSplitter.
537  
538  **Parameters:**
539  
540  - **document_embedder** (<code>DocumentEmbedder</code>) – The DocumentEmbedder to use for calculating embeddings.
541  - **sentences_per_group** (<code>int</code>) – Number of sentences to group together before embedding.
542  - **percentile** (<code>float</code>) – Percentile threshold for cosine distance. Distances above this percentile
543    are treated as break points.
544  - **min_length** (<code>int</code>) – Minimum length of splits in characters. Splits below this length will be merged.
545  - **max_length** (<code>int</code>) – Maximum length of splits in characters. Splits above this length will be recursively split.
546  - **language** (<code>Language</code>) – Language for sentence tokenization.
547  - **use_split_rules** (<code>bool</code>) – Whether to use additional split rules for sentence tokenization. Applies additional
548    split rules from SentenceSplitter to the sentence spans.
549  - **extend_abbreviations** (<code>bool</code>) – If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list
550    of curated abbreviations. Currently supported languages are: en, de.
551    If False, the default abbreviations are used.
552  
553  #### warm_up
554  
555  ```python
556  warm_up() -> None
557  ```
558  
559  Warm up the component by initializing the sentence splitter.
560  
561  #### run
562  
563  ```python
564  run(documents: list[Document]) -> dict[str, list[Document]]
565  ```
566  
567  Split documents based on embedding similarity.
568  
569  **Parameters:**
570  
571  - **documents** (<code>list\[Document\]</code>) – The documents to split.
572  
573  **Returns:**
574  
575  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
576  - `documents`: List of documents with the split texts. Each document includes:
577    - A metadata field `source_id` to track the original document.
578    - A metadata field `split_id` to track the split number.
579    - A metadata field `page_number` to track the original page number.
580    - All other metadata copied from the original document.
581  
582  **Raises:**
583  
584  - <code>RuntimeError</code> – If the component wasn't warmed up.
585  - <code>TypeError</code> – If the input is not a list of Documents.
586  - <code>ValueError</code> – If the document content is None or empty.
587  
588  #### run_async
589  
590  ```python
591  run_async(documents: list[Document]) -> dict[str, list[Document]]
592  ```
593  
594  Asynchronously split documents based on embedding similarity.
595  
596  This is the asynchronous version of the `run` method with the same parameters and return values.
597  
598  **Parameters:**
599  
600  - **documents** (<code>list\[Document\]</code>) – The documents to split.
601  
602  **Returns:**
603  
604  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
605  - `documents`: List of documents with the split texts. Each document includes:
606    - A metadata field `source_id` to track the original document.
607    - A metadata field `split_id` to track the split number.
608    - A metadata field `page_number` to track the original page number.
609    - All other metadata copied from the original document.
610  
611  **Raises:**
612  
613  - <code>RuntimeError</code> – If the component wasn't warmed up.
614  - <code>TypeError</code> – If the input is not a list of Documents.
615  - <code>ValueError</code> – If the document content is None or empty.
616  
617  #### to_dict
618  
619  ```python
620  to_dict() -> dict[str, Any]
621  ```
622  
623  Serializes the component to a dictionary.
624  
625  **Returns:**
626  
627  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
628  
629  #### from_dict
630  
631  ```python
632  from_dict(data: dict[str, Any]) -> EmbeddingBasedDocumentSplitter
633  ```
634  
635  Deserializes the component from a dictionary.
636  
637  **Parameters:**
638  
639  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
640  
641  **Returns:**
642  
643  - <code>EmbeddingBasedDocumentSplitter</code> – The deserialized component.
644  
645  ## hierarchical_document_splitter
646  
647  ### HierarchicalDocumentSplitter
648  
649  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
650  
651  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
652  are connected such that the smaller blocks are children of the parent-larger blocks.
653  
654  ## Usage example
655  
656  ```python
657  from haystack import Document
658  from haystack.components.preprocessors import HierarchicalDocumentSplitter
659  
660  doc = Document(content="This is a simple test document")
661  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
662  splitter.run([doc])
663  # >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
664  # >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
665  # >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
666  # >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
667  # >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
668  # >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
669  # >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
670  ```
671  
672  #### __init__
673  
674  ```python
675  __init__(
676      block_sizes: set[int],
677      split_overlap: int = 0,
678      split_by: Literal["word", "sentence", "page", "passage"] = "word",
679  ) -> None
680  ```
681  
682  Initialize HierarchicalDocumentSplitter.
683  
684  **Parameters:**
685  
686  - **block_sizes** (<code>set\[int\]</code>) – Set of block sizes to split the document into. The blocks are split in descending order.
687  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
688  - **split_by** (<code>Literal['word', 'sentence', 'page', 'passage']</code>) – The unit for splitting your documents.
689  
690  #### run
691  
692  ```python
693  run(documents: list[Document]) -> dict[str, list[Document]]
694  ```
695  
696  Builds a hierarchical document structure for each document in a list of documents.
697  
698  **Parameters:**
699  
700  - **documents** (<code>list\[Document\]</code>) – List of Documents to split into hierarchical blocks.
701  
702  **Returns:**
703  
704  - <code>dict\[str, list\[Document\]\]</code> – List of HierarchicalDocument
705  
706  #### build_hierarchy_from_doc
707  
708  ```python
709  build_hierarchy_from_doc(document: Document) -> list[Document]
710  ```
711  
712  Build a hierarchical tree document structure from a single document.
713  
714  Given a document, this function splits the document into hierarchical blocks of different sizes represented
715  as HierarchicalDocument objects.
716  
717  **Parameters:**
718  
719  - **document** (<code>Document</code>) – Document to split into hierarchical blocks.
720  
721  **Returns:**
722  
723  - <code>list\[Document\]</code> – List of HierarchicalDocument
724  
725  #### to_dict
726  
727  ```python
728  to_dict() -> dict[str, Any]
729  ```
730  
731  Returns a dictionary representation of the component.
732  
733  **Returns:**
734  
735  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
736  
737  #### from_dict
738  
739  ```python
740  from_dict(data: dict[str, Any]) -> HierarchicalDocumentSplitter
741  ```
742  
743  Deserialize this component from a dictionary.
744  
745  **Parameters:**
746  
747  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
748  
749  **Returns:**
750  
751  - <code>HierarchicalDocumentSplitter</code> – The deserialized component.
752  
753  ## markdown_header_splitter
754  
755  ### MarkdownHeaderSplitter
756  
757  Split documents at ATX-style Markdown headers (#), with optional secondary splitting.
758  
759  This component processes text documents by:
760  
761  - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata.
762  - Optionally applying a secondary split (by word, passage, period, or line) to each chunk
763    (using haystack's DocumentSplitter).
764  - Preserving and propagating metadata such as parent headers, page numbers, and split IDs.
765  
766  #### __init__
767  
768  ```python
769  __init__(
770      *,
771      page_break_character: str = "\x0c",
772      keep_headers: bool = True,
773      header_split_levels: list[int] | None = None,
774      secondary_split: Literal["word", "passage", "period", "line"] | None = None,
775      split_length: int = 200,
776      split_overlap: int = 0,
777      split_threshold: int = 0,
778      skip_empty_documents: bool = True
779  ) -> None
780  ```
781  
782  Initialize the MarkdownHeaderSplitter.
783  
784  **Parameters:**
785  
786  - **page_break_character** (<code>str</code>) – Character used to identify page breaks. Defaults to form feed ("").
787  - **keep_headers** (<code>bool</code>) – If True, headers are kept in the content. If False, headers are moved to metadata.
788    Defaults to True.
789  - **header_split_levels** (<code>list\[int\] | None</code>) – List of header levels (1–6) to split on. For example, `[1, 2]` splits only
790    on `#` and `##` headers, merging content under deeper headers into the preceding chunk. Defaults to
791    all levels `[1, 2, 3, 4, 5, 6]`.
792  - **secondary_split** (<code>Literal['word', 'passage', 'period', 'line'] | None</code>) – Optional secondary split condition after header splitting.
793    Options are None, "word", "passage", "period", "line". Defaults to None.
794  - **split_length** (<code>int</code>) – The maximum number of units in each split when using secondary splitting. Defaults to 200.
795  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split when using secondary splitting.
796    Defaults to 0.
797  - **split_threshold** (<code>int</code>) – The minimum number of units per split when using secondary splitting. Defaults to 0.
798  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
799    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
800    from non-textual documents.
801  
802  #### warm_up
803  
804  ```python
805  warm_up() -> None
806  ```
807  
808  Warm up the MarkdownHeaderSplitter.
809  
810  #### run
811  
812  ```python
813  run(documents: list[Document]) -> dict[str, list[Document]]
814  ```
815  
816  Run the markdown header splitter with optional secondary splitting.
817  
818  **Parameters:**
819  
820  - **documents** (<code>list\[Document\]</code>) – List of documents to split
821  
822  **Returns:**
823  
824  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
825  - `documents`: List of documents with the split texts. Each document includes:
826    - A metadata field `source_id` to track the original document.
827    - A metadata field `page_number` to track the original page number.
828    - A metadata field `split_id` to identify the split chunk index within its parent document.
829    - All other metadata copied from the original document.
830  
831  **Raises:**
832  
833  - <code>ValueError</code> – If a document has `None` content.
834  - <code>TypeError</code> – If a document's content is not a string.
835  
836  ## recursive_splitter
837  
838  ### RecursiveDocumentSplitter
839  
840  Recursively chunk text into smaller chunks.
841  
842  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
843  to the text.
844  
845  The separators are applied in the order they are provided, typically this is a list of separators that are
846  applied in a specific order, being the last separator the most specific one.
847  
848  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
849  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
850  list to the remaining text.
851  
852  This is done until all chunks are smaller than the split_length parameter.
853  
854  Example:
855  
856  ```python
857  from haystack import Document
858  from haystack.components.preprocessors import RecursiveDocumentSplitter
859  
860  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
861  text = ('''Artificial intelligence (AI) - Introduction
862  
863  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
864  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
865  doc = Document(content=text)
866  doc_chunks = chunker.run([doc])
867  print(doc_chunks["documents"])
868  # [
869  # Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
870  # Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
871  # Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
872  # Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
873  # ]
874  ```
875  
876  #### __init__
877  
878  ```python
879  __init__(
880      *,
881      split_length: int = 200,
882      split_overlap: int = 0,
883      split_unit: Literal["word", "char", "token"] = "word",
884      separators: list[str] | None = None,
885      sentence_splitter_params: dict[str, Any] | None = None
886  ) -> None
887  ```
888  
889  Initializes a RecursiveDocumentSplitter.
890  
891  **Parameters:**
892  
893  - **split_length** (<code>int</code>) – The maximum length of each chunk by default in words, but can be in characters or tokens.
894    See the `split_units` parameter.
895  - **split_overlap** (<code>int</code>) – The number of characters to overlap between consecutive chunks.
896  - **split_unit** (<code>Literal['word', 'char', 'token']</code>) – The unit of the split_length parameter. It can be either "word", "char", or "token".
897    If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
898  - **separators** (<code>list\[str\] | None</code>) – An optional list of separator strings to use for splitting the text. The string
899    separators will be treated as regular expressions unless the separator is "sentence", in that case the
900    text will be split into sentences using a custom sentence tokenizer based on NLTK.
901    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
902    If no separators are provided, the default separators ["\\n\\n", "sentence", "\\n", " "] are used.
903  - **sentence_splitter_params** (<code>dict\[str, Any\] | None</code>) – Optional parameters to pass to the sentence tokenizer.
904    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
905  
906  **Raises:**
907  
908  - <code>ValueError</code> – If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
909    if any separator is not a string.
910  
911  #### warm_up
912  
913  ```python
914  warm_up() -> None
915  ```
916  
917  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
918  
919  #### run
920  
921  ```python
922  run(documents: list[Document]) -> dict[str, list[Document]]
923  ```
924  
925  Split a list of documents into documents with smaller chunks of text.
926  
927  **Parameters:**
928  
929  - **documents** (<code>list\[Document\]</code>) – List of Documents to split.
930  
931  **Returns:**
932  
933  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
934    to the input documents.
935  
936  ## text_cleaner
937  
938  ### TextCleaner
939  
940  Cleans text strings.
941  
942  It can remove substrings matching a list of regular expressions, convert text to lowercase,
943  remove punctuation, and remove numbers.
944  Use it to clean up text data before evaluation.
945  
946  ### Usage example
947  
948  ```python
949  from haystack.components.preprocessors import TextCleaner
950  
951  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
952  
953  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
954  result = cleaner.run(texts=[text_to_clean])
955  ```
956  
957  #### __init__
958  
959  ```python
960  __init__(
961      remove_regexps: list[str] | None = None,
962      convert_to_lowercase: bool = False,
963      remove_punctuation: bool = False,
964      remove_numbers: bool = False,
965  ) -> None
966  ```
967  
968  Initializes the TextCleaner component.
969  
970  **Parameters:**
971  
972  - **remove_regexps** (<code>list\[str\] | None</code>) – A list of regex patterns to remove matching substrings from the text.
973  - **convert_to_lowercase** (<code>bool</code>) – If `True`, converts all characters to lowercase.
974  - **remove_punctuation** (<code>bool</code>) – If `True`, removes punctuation from the text.
975  - **remove_numbers** (<code>bool</code>) – If `True`, removes numerical digits from the text.
976  
977  #### run
978  
979  ```python
980  run(texts: list[str]) -> dict[str, Any]
981  ```
982  
983  Cleans up the given list of strings.
984  
985  **Parameters:**
986  
987  - **texts** (<code>list\[str\]</code>) – List of strings to clean.
988  
989  **Returns:**
990  
991  - <code>dict\[str, Any\]</code> – A dictionary with the following key:
992  - `texts`: the cleaned list of strings.