Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / haystack-api / preprocessors_api.md
preprocessors_api.md
  1  ---
  2  title: "PreProcessors"
  3  id: preprocessors-api
  4  description: "Preprocess your Documents and texts. Clean, split, and more."
  5  slug: "/preprocessors-api"
  6  ---
  7  
  8  
  9  ## csv_document_cleaner
 10  
 11  ### CSVDocumentCleaner
 12  
 13  A component for cleaning CSV documents by removing empty rows and columns.
 14  
 15  This component processes CSV content stored in Documents, allowing
 16  for the optional ignoring of a specified number of rows and columns before performing
 17  the cleaning operation. Additionally, it provides options to keep document IDs and
 18  control whether empty rows and columns should be removed.
 19  
 20  #### __init__
 21  
 22  ```python
 23  __init__(
 24      *,
 25      ignore_rows: int = 0,
 26      ignore_columns: int = 0,
 27      remove_empty_rows: bool = True,
 28      remove_empty_columns: bool = True,
 29      keep_id: bool = False
 30  ) -> None
 31  ```
 32  
 33  Initializes the CSVDocumentCleaner component.
 34  
 35  **Parameters:**
 36  
 37  - **ignore_rows** (<code>int</code>) – Number of rows to ignore from the top of the CSV table before processing.
 38  - **ignore_columns** (<code>int</code>) – Number of columns to ignore from the left of the CSV table before processing.
 39  - **remove_empty_rows** (<code>bool</code>) – Whether to remove rows that are entirely empty.
 40  - **remove_empty_columns** (<code>bool</code>) – Whether to remove columns that are entirely empty.
 41  - **keep_id** (<code>bool</code>) – Whether to retain the original document ID in the output document.
 42  
 43  Rows and columns ignored using these parameters are preserved in the final output, meaning
 44  they are not considered when removing empty rows and columns.
 45  
 46  #### run
 47  
 48  ```python
 49  run(documents: list[Document]) -> dict[str, list[Document]]
 50  ```
 51  
 52  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
 53  
 54  **Parameters:**
 55  
 56  - **documents** (<code>list\[Document\]</code>) – List of Documents containing CSV-formatted content.
 57  
 58  **Returns:**
 59  
 60  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a list of cleaned Documents under the key "documents".
 61  
 62  Processing steps:
 63  
 64  1. Reads each document's content as a CSV table.
 65  1. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
 66  1. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
 67     `remove_empty_columns`).
 68  1. Reattaches the ignored rows and columns to maintain their original positions.
 69  1. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
 70     document ID.
 71  
 72  ## csv_document_splitter
 73  
 74  ### CSVDocumentSplitter
 75  
 76  A component for splitting CSV documents into sub-tables based on split arguments.
 77  
 78  The splitter supports two modes of operation:
 79  
 80  - identify consecutive empty rows or columns that exceed a given threshold
 81    and uses them as delimiters to segment the document into smaller tables.
 82  - split each row into a separate sub-table, represented as a Document.
 83  
 84  #### __init__
 85  
 86  ```python
 87  __init__(
 88      row_split_threshold: int | None = 2,
 89      column_split_threshold: int | None = 2,
 90      read_csv_kwargs: dict[str, Any] | None = None,
 91      split_mode: SplitMode = "threshold",
 92  ) -> None
 93  ```
 94  
 95  Initializes the CSVDocumentSplitter component.
 96  
 97  **Parameters:**
 98  
 99  - **row_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty rows required to trigger a split.
100  - **column_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty columns required to trigger a split.
101  - **read_csv_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to `pandas.read_csv`.
102    By default, the component with options:
103  - `header=None`
104  - `skip_blank_lines=False` to preserve blank lines
105  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
106    See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
107  - **split_mode** (<code>SplitMode</code>) – If `threshold`, the component will split the document based on the number of
108    consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
109    If `row-wise`, the component will split each row into a separate sub-table.
110  
111  #### run
112  
113  ```python
114  run(documents: list[Document]) -> dict[str, list[Document]]
115  ```
116  
117  Processes and splits a list of CSV documents into multiple sub-tables.
118  
119  **Splitting Process:**
120  
121  1. Applies a row-based split if `row_split_threshold` is provided.
122  1. Applies a column-based split if `column_split_threshold` is provided.
123  1. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
124     further fragmentation of any sub-tables that still contain empty sections.
125  1. Sorts the resulting sub-tables based on their original positions within the document.
126  
127  **Parameters:**
128  
129  - **documents** (<code>list\[Document\]</code>) – A list of Documents containing CSV-formatted content.
130    Each document is assumed to contain one or more tables separated by empty rows or columns.
131  
132  **Returns:**
133  
134  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
135    each representing an extracted sub-table from the original CSV.
136    The metadata of each document includes:
137    \- A field `source_id` to track the original document.
138    \- A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
139    \- A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
140    \- A field `split_id` to indicate the order of the split in the original document.
141    \- All other metadata copied from the original document.
142  
143  - If a document cannot be processed, it is returned unchanged.
144  
145  - The `meta` field from the original document is preserved in the split documents.
146  
147  ## document_cleaner
148  
149  ### DocumentCleaner
150  
151  Cleans the text in the documents.
152  
153  It removes extra whitespaces,
154  empty lines, specified substrings, regexes,
155  page headers and footers (in this order).
156  
157  ### Usage example:
158  
159  ```python
160  from haystack import Document
161  from haystack.components.preprocessors import DocumentCleaner
162  
163  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
164  
165  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
166  result = cleaner.run(documents=[doc])
167  
168  assert result["documents"][0].content == "This is a document to clean "
169  ```
170  
171  #### __init__
172  
173  ```python
174  __init__(
175      remove_empty_lines: bool = True,
176      remove_extra_whitespaces: bool = True,
177      remove_repeated_substrings: bool = False,
178      keep_id: bool = False,
179      remove_substrings: list[str] | None = None,
180      remove_regex: str | None = None,
181      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
182      ascii_only: bool = False,
183      strip_whitespaces: bool = False,
184      replace_regexes: dict[str, str] | None = None,
185  )
186  ```
187  
188  Initialize DocumentCleaner.
189  
190  **Parameters:**
191  
192  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
193  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
194  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings (headers and footers) from pages.
195    Pages must be separated by a form feed character "\\f",
196    which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
197  - **remove_substrings** (<code>list\[str\] | None</code>) – List of substrings to remove from the text.
198  - **remove_regex** (<code>str | None</code>) – Regex to match and replace substrings by "".
199  - **keep_id** (<code>bool</code>) – If `True`, keeps the IDs of the original documents.
200  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text.
201    Note: This will run before any other steps.
202  - **ascii_only** (<code>bool</code>) – Whether to convert the text to ASCII only.
203    Will remove accents from characters and replace them with ASCII characters.
204    Other non-ASCII characters will be removed.
205    Note: This will run before any pattern matching or removal.
206  - **strip_whitespaces** (<code>bool</code>) – If `True`, removes leading and trailing whitespace from the document content
207    using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning
208    and end of the text, preserving internal whitespace (useful for markdown formatting).
209  - **replace_regexes** (<code>dict\[str, str\] | None</code>) – A dictionary mapping regex patterns to their replacement strings.
210    For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline.
211    This is applied after `remove_regex` and allows custom replacements instead of just removal.
212  
213  #### run
214  
215  ```python
216  run(documents: list[Document])
217  ```
218  
219  Cleans up the documents.
220  
221  **Parameters:**
222  
223  - **documents** (<code>list\[Document\]</code>) – List of Documents to clean.
224  
225  **Returns:**
226  
227  - – A dictionary with the following key:
228  - `documents`: List of cleaned Documents.
229  
230  **Raises:**
231  
232  - <code>TypeError</code> – if documents is not a list of Documents.
233  
234  ## document_preprocessor
235  
236  ### DocumentPreprocessor
237  
238  A SuperComponent that first splits and then cleans documents.
239  
240  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
241  It takes a list of documents as input and returns a processed list of documents.
242  
243  Usage example:
244  
245  ```python
246  from haystack import Document
247  from haystack.components.preprocessors import DocumentPreprocessor
248  
249  doc = Document(content="I love pizza!")
250  preprocessor = DocumentPreprocessor()
251  result = preprocessor.run(documents=[doc])
252  print(result["documents"])
253  ```
254  
255  #### __init__
256  
257  ```python
258  __init__(
259      *,
260      split_by: Literal[
261          "function", "page", "passage", "period", "word", "line", "sentence"
262      ] = "word",
263      split_length: int = 250,
264      split_overlap: int = 0,
265      split_threshold: int = 0,
266      splitting_function: Callable[[str], list[str]] | None = None,
267      respect_sentence_boundary: bool = False,
268      language: Language = "en",
269      use_split_rules: bool = True,
270      extend_abbreviations: bool = True,
271      remove_empty_lines: bool = True,
272      remove_extra_whitespaces: bool = True,
273      remove_repeated_substrings: bool = False,
274      keep_id: bool = False,
275      remove_substrings: list[str] | None = None,
276      remove_regex: str | None = None,
277      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
278      ascii_only: bool = False
279  ) -> None
280  ```
281  
282  Initialize a DocumentPreProcessor that first splits and then cleans documents.
283  
284  **Splitter Parameters**:
285  
286  **Parameters:**
287  
288  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
289  - **split_length** (<code>int</code>) – The maximum number of units (words, lines, pages, and so on) in each split.
290  - **split_overlap** (<code>int</code>) – The number of overlapping units between consecutive splits.
291  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split is smaller than this, it's merged
292    with the previous split.
293  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – A custom function for splitting if `split_by="function"`.
294  - **respect_sentence_boundary** (<code>bool</code>) – If `True`, splits by words but tries not to break inside a sentence.
295  - **language** (<code>Language</code>) – Language used by the sentence tokenizer if `split_by="sentence"` or
296    `respect_sentence_boundary=True`.
297  - **use_split_rules** (<code>bool</code>) – Whether to apply additional splitting heuristics for the sentence splitter.
298  - **extend_abbreviations** (<code>bool</code>) – Whether to extend the sentence splitter with curated abbreviations for certain
299    languages.
300  
301  **Cleaner Parameters**:
302  
303  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
304  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
305  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings like headers/footers across pages.
306  - **keep_id** (<code>bool</code>) – If `True`, keeps the original document IDs.
307  - **remove_substrings** (<code>list\[str\] | None</code>) – A list of strings to remove from the document content.
308  - **remove_regex** (<code>str | None</code>) – A regex pattern whose matches will be removed from the document content.
309  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text, for example `"NFC"`.
310  - **ascii_only** (<code>bool</code>) – If `True`, converts text to ASCII only.
311  
312  #### to_dict
313  
314  ```python
315  to_dict() -> dict[str, Any]
316  ```
317  
318  Serialize SuperComponent to a dictionary.
319  
320  **Returns:**
321  
322  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
323  
324  #### from_dict
325  
326  ```python
327  from_dict(data: dict[str, Any]) -> DocumentPreprocessor
328  ```
329  
330  Deserializes the SuperComponent from a dictionary.
331  
332  **Parameters:**
333  
334  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
335  
336  **Returns:**
337  
338  - <code>DocumentPreprocessor</code> – Deserialized SuperComponent.
339  
340  ## document_splitter
341  
342  ### DocumentSplitter
343  
344  Splits long documents into smaller chunks.
345  
346  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
347  and prevents exceeding language model context limits.
348  
349  The DocumentSplitter is compatible with the following DocumentStores:
350  
351  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
352  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
353    not stored
354  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
355  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
356  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
357  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
358    information is not stored
359  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
360  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
361  
362  ### Usage example
363  
364  ```python
365  from haystack import Document
366  from haystack.components.preprocessors import DocumentSplitter
367  
368  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
369  
370  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
371  result = splitter.run(documents=[doc])
372  ```
373  
374  #### __init__
375  
376  ```python
377  __init__(
378      split_by: Literal[
379          "function", "page", "passage", "period", "word", "line", "sentence"
380      ] = "word",
381      split_length: int = 200,
382      split_overlap: int = 0,
383      split_threshold: int = 0,
384      splitting_function: Callable[[str], list[str]] | None = None,
385      respect_sentence_boundary: bool = False,
386      language: Language = "en",
387      use_split_rules: bool = True,
388      extend_abbreviations: bool = True,
389      *,
390      skip_empty_documents: bool = True
391  )
392  ```
393  
394  Initialize DocumentSplitter.
395  
396  **Parameters:**
397  
398  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit for splitting your documents. Choose from:
399  - `word` for splitting by spaces (" ")
400  - `period` for splitting by periods (".")
401  - `page` for splitting by form feed ("\\f")
402  - `passage` for splitting by double line breaks ("\\n\\n")
403  - `line` for splitting each line ("\\n")
404  - `sentence` for splitting by NLTK sentence tokenizer
405  - **split_length** (<code>int</code>) – The maximum number of units in each split.
406  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
407  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
408    than the threshold, it's attached to the previous split.
409  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – Necessary when `split_by` is set to "function".
410    This is a function which must accept a single `str` as input and return a `list` of `str` as output,
411    representing the chunks after splitting.
412  - **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
413    If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
414  - **language** (<code>Language</code>) – Choose the language for the NLTK tokenizer. The default is English ("en").
415  - **use_split_rules** (<code>bool</code>) – Choose whether to use additional split rules when splitting by `sentence`.
416  - **extend_abbreviations** (<code>bool</code>) – Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
417    of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
418  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
419    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
420    from non-textual documents.
421  
422  #### warm_up
423  
424  ```python
425  warm_up()
426  ```
427  
428  Warm up the DocumentSplitter by loading the sentence tokenizer.
429  
430  #### run
431  
432  ```python
433  run(documents: list[Document])
434  ```
435  
436  Split documents into smaller parts.
437  
438  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
439  and an overlap of `split_overlap`.
440  
441  **Parameters:**
442  
443  - **documents** (<code>list\[Document\]</code>) – The documents to split.
444  
445  **Returns:**
446  
447  - – A dictionary with the following key:
448  - `documents`: List of documents with the split texts. Each document includes:
449    - A metadata field `source_id` to track the original document.
450    - A metadata field `page_number` to track the original page number.
451    - All other metadata copied from the original document.
452  
453  **Raises:**
454  
455  - <code>TypeError</code> – if the input is not a list of Documents.
456  - <code>ValueError</code> – if the content of a document is None.
457  
458  #### to_dict
459  
460  ```python
461  to_dict() -> dict[str, Any]
462  ```
463  
464  Serializes the component to a dictionary.
465  
466  #### from_dict
467  
468  ```python
469  from_dict(data: dict[str, Any]) -> DocumentSplitter
470  ```
471  
472  Deserializes the component from a dictionary.
473  
474  ## embedding_based_document_splitter
475  
476  ### EmbeddingBasedDocumentSplitter
477  
478  Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
479  
480  This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
481  and then uses cosine distance between sequential embeddings to determine split points. Any distance above
482  the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
483  characters (``) in the original document.
484  
485  This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
486  
487  ### Usage example
488  
489  ```python
490  from haystack import Document
491  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
492  from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
493  
494  # Create a document with content that has a clear topic shift
495  doc = Document(
496      content="This is a first sentence. This is a second sentence. This is a third sentence. "
497      "Completely different topic. The same completely different topic."
498  )
499  
500  # Initialize the embedder to calculate semantic similarities
501  embedder = SentenceTransformersDocumentEmbedder()
502  
503  # Configure the splitter with parameters that control splitting behavior
504  splitter = EmbeddingBasedDocumentSplitter(
505      document_embedder=embedder,
506      sentences_per_group=2,      # Group 2 sentences before calculating embeddings
507      percentile=0.95,            # Split when cosine distance exceeds 95th percentile
508      min_length=50,              # Merge splits shorter than 50 characters
509      max_length=1000             # Further split chunks longer than 1000 characters
510  )
511  result = splitter.run(documents=[doc])
512  
513  # The result contains a list of Document objects, each representing a semantic chunk
514  # Each split document includes metadata: source_id, split_id, and page_number
515  print(f"Original document split into {len(result['documents'])} chunks")
516  for i, split_doc in enumerate(result['documents']):
517      print(f"Chunk {i}: {split_doc.content[:50]}...")
518  ```
519  
520  #### __init__
521  
522  ```python
523  __init__(
524      *,
525      document_embedder: DocumentEmbedder,
526      sentences_per_group: int = 3,
527      percentile: float = 0.95,
528      min_length: int = 50,
529      max_length: int = 1000,
530      language: Language = "en",
531      use_split_rules: bool = True,
532      extend_abbreviations: bool = True
533  ) -> None
534  ```
535  
536  Initialize EmbeddingBasedDocumentSplitter.
537  
538  **Parameters:**
539  
540  - **document_embedder** (<code>DocumentEmbedder</code>) – The DocumentEmbedder to use for calculating embeddings.
541  - **sentences_per_group** (<code>int</code>) – Number of sentences to group together before embedding.
542  - **percentile** (<code>float</code>) – Percentile threshold for cosine distance. Distances above this percentile
543    are treated as break points.
544  - **min_length** (<code>int</code>) – Minimum length of splits in characters. Splits below this length will be merged.
545  - **max_length** (<code>int</code>) – Maximum length of splits in characters. Splits above this length will be recursively split.
546  - **language** (<code>Language</code>) – Language for sentence tokenization.
547  - **use_split_rules** (<code>bool</code>) – Whether to use additional split rules for sentence tokenization. Applies additional
548    split rules from SentenceSplitter to the sentence spans.
549  - **extend_abbreviations** (<code>bool</code>) – If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list
550    of curated abbreviations. Currently supported languages are: en, de.
551    If False, the default abbreviations are used.
552  
553  #### warm_up
554  
555  ```python
556  warm_up() -> None
557  ```
558  
559  Warm up the component by initializing the sentence splitter.
560  
561  #### run
562  
563  ```python
564  run(documents: list[Document]) -> dict[str, list[Document]]
565  ```
566  
567  Split documents based on embedding similarity.
568  
569  **Parameters:**
570  
571  - **documents** (<code>list\[Document\]</code>) – The documents to split.
572  
573  **Returns:**
574  
575  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
576  - `documents`: List of documents with the split texts. Each document includes:
577    - A metadata field `source_id` to track the original document.
578    - A metadata field `split_id` to track the split number.
579    - A metadata field `page_number` to track the original page number.
580    - All other metadata copied from the original document.
581  
582  **Raises:**
583  
584  - <code>RuntimeError</code> – If the component wasn't warmed up.
585  - <code>TypeError</code> – If the input is not a list of Documents.
586  - <code>ValueError</code> – If the document content is None or empty.
587  
588  #### run_async
589  
590  ```python
591  run_async(documents: list[Document]) -> dict[str, list[Document]]
592  ```
593  
594  Asynchronously split documents based on embedding similarity.
595  
596  This is the asynchronous version of the `run` method with the same parameters and return values.
597  
598  **Parameters:**
599  
600  - **documents** (<code>list\[Document\]</code>) – The documents to split.
601  
602  **Returns:**
603  
604  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
605  - `documents`: List of documents with the split texts. Each document includes:
606    - A metadata field `source_id` to track the original document.
607    - A metadata field `split_id` to track the split number.
608    - A metadata field `page_number` to track the original page number.
609    - All other metadata copied from the original document.
610  
611  **Raises:**
612  
613  - <code>RuntimeError</code> – If the component wasn't warmed up.
614  - <code>TypeError</code> – If the input is not a list of Documents.
615  - <code>ValueError</code> – If the document content is None or empty.
616  
617  #### to_dict
618  
619  ```python
620  to_dict() -> dict[str, Any]
621  ```
622  
623  Serializes the component to a dictionary.
624  
625  **Returns:**
626  
627  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
628  
629  #### from_dict
630  
631  ```python
632  from_dict(data: dict[str, Any]) -> EmbeddingBasedDocumentSplitter
633  ```
634  
635  Deserializes the component from a dictionary.
636  
637  **Parameters:**
638  
639  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
640  
641  **Returns:**
642  
643  - <code>EmbeddingBasedDocumentSplitter</code> – The deserialized component.
644  
645  ## hierarchical_document_splitter
646  
647  ### HierarchicalDocumentSplitter
648  
649  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
650  
651  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
652  are connected such that the smaller blocks are children of the parent-larger blocks.
653  
654  ## Usage example
655  
656  ```python
657  from haystack import Document
658  from haystack.components.preprocessors import HierarchicalDocumentSplitter
659  
660  doc = Document(content="This is a simple test document")
661  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
662  splitter.run([doc])
663  >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
664  >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
665  >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
666  >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
667  >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
668  >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
669  >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
670  ```
671  
672  #### __init__
673  
674  ```python
675  __init__(
676      block_sizes: set[int],
677      split_overlap: int = 0,
678      split_by: Literal["word", "sentence", "page", "passage"] = "word",
679  )
680  ```
681  
682  Initialize HierarchicalDocumentSplitter.
683  
684  **Parameters:**
685  
686  - **block_sizes** (<code>set\[int\]</code>) – Set of block sizes to split the document into. The blocks are split in descending order.
687  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
688  - **split_by** (<code>Literal['word', 'sentence', 'page', 'passage']</code>) – The unit for splitting your documents.
689  
690  #### run
691  
692  ```python
693  run(documents: list[Document])
694  ```
695  
696  Builds a hierarchical document structure for each document in a list of documents.
697  
698  **Parameters:**
699  
700  - **documents** (<code>list\[Document\]</code>) – List of Documents to split into hierarchical blocks.
701  
702  **Returns:**
703  
704  - – List of HierarchicalDocument
705  
706  #### build_hierarchy_from_doc
707  
708  ```python
709  build_hierarchy_from_doc(document: Document) -> list[Document]
710  ```
711  
712  Build a hierarchical tree document structure from a single document.
713  
714  Given a document, this function splits the document into hierarchical blocks of different sizes represented
715  as HierarchicalDocument objects.
716  
717  **Parameters:**
718  
719  - **document** (<code>Document</code>) – Document to split into hierarchical blocks.
720  
721  **Returns:**
722  
723  - <code>list\[Document\]</code> – List of HierarchicalDocument
724  
725  #### to_dict
726  
727  ```python
728  to_dict() -> dict[str, Any]
729  ```
730  
731  Returns a dictionary representation of the component.
732  
733  **Returns:**
734  
735  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
736  
737  #### from_dict
738  
739  ```python
740  from_dict(data: dict[str, Any]) -> HierarchicalDocumentSplitter
741  ```
742  
743  Deserialize this component from a dictionary.
744  
745  **Parameters:**
746  
747  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
748  
749  **Returns:**
750  
751  - <code>HierarchicalDocumentSplitter</code> – The deserialized component.
752  
753  ## markdown_header_splitter
754  
755  ### MarkdownHeaderSplitter
756  
757  Split documents at ATX-style Markdown headers (#), with optional secondary splitting.
758  
759  This component processes text documents by:
760  
761  - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata.
762  - Optionally applying a secondary split (by word, passage, period, or line) to each chunk
763    (using haystack's DocumentSplitter).
764  - Preserving and propagating metadata such as parent headers, page numbers, and split IDs.
765  
766  #### __init__
767  
768  ```python
769  __init__(
770      *,
771      page_break_character: str = "\x0c",
772      keep_headers: bool = True,
773      secondary_split: Literal["word", "passage", "period", "line"] | None = None,
774      split_length: int = 200,
775      split_overlap: int = 0,
776      split_threshold: int = 0,
777      skip_empty_documents: bool = True
778  )
779  ```
780  
781  Initialize the MarkdownHeaderSplitter.
782  
783  **Parameters:**
784  
785  - **page_break_character** (<code>str</code>) – Character used to identify page breaks. Defaults to form feed ("").
786  - **keep_headers** (<code>bool</code>) – If True, headers are kept in the content. If False, headers are moved to metadata.
787    Defaults to True.
788  - **secondary_split** (<code>Literal['word', 'passage', 'period', 'line'] | None</code>) – Optional secondary split condition after header splitting.
789    Options are None, "word", "passage", "period", "line". Defaults to None.
790  - **split_length** (<code>int</code>) – The maximum number of units in each split when using secondary splitting. Defaults to 200.
791  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split when using secondary splitting.
792    Defaults to 0.
793  - **split_threshold** (<code>int</code>) – The minimum number of units per split when using secondary splitting. Defaults to 0.
794  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
795    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
796    from non-textual documents.
797  
798  #### warm_up
799  
800  ```python
801  warm_up()
802  ```
803  
804  Warm up the MarkdownHeaderSplitter.
805  
806  #### run
807  
808  ```python
809  run(documents: list[Document]) -> dict[str, list[Document]]
810  ```
811  
812  Run the markdown header splitter with optional secondary splitting.
813  
814  **Parameters:**
815  
816  - **documents** (<code>list\[Document\]</code>) – List of documents to split
817  
818  **Returns:**
819  
820  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
821  - `documents`: List of documents with the split texts. Each document includes:
822    - A metadata field `source_id` to track the original document.
823    - A metadata field `page_number` to track the original page number.
824    - A metadata field `split_id` to identify the split chunk index within its parent document.
825    - All other metadata copied from the original document.
826  
827  **Raises:**
828  
829  - <code>ValueError</code> – If a document has `None` content.
830  - <code>TypeError</code> – If a document's content is not a string.
831  
832  ## recursive_splitter
833  
834  ### RecursiveDocumentSplitter
835  
836  Recursively chunk text into smaller chunks.
837  
838  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
839  to the text.
840  
841  The separators are applied in the order they are provided, typically this is a list of separators that are
842  applied in a specific order, being the last separator the most specific one.
843  
844  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
845  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
846  list to the remaining text.
847  
848  This is done until all chunks are smaller than the split_length parameter.
849  
850  Example:
851  
852  ```python
853  from haystack import Document
854  from haystack.components.preprocessors import RecursiveDocumentSplitter
855  
856  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
857  text = ('''Artificial intelligence (AI) - Introduction
858  
859  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
860  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
861  doc = Document(content=text)
862  doc_chunks = chunker.run([doc])
863  print(doc_chunks["documents"])
864  >[
865  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
866  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
867  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
868  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
869  >]
870  ```
871  
872  #### __init__
873  
874  ```python
875  __init__(
876      *,
877      split_length: int = 200,
878      split_overlap: int = 0,
879      split_unit: Literal["word", "char", "token"] = "word",
880      separators: list[str] | None = None,
881      sentence_splitter_params: dict[str, Any] | None = None
882  )
883  ```
884  
885  Initializes a RecursiveDocumentSplitter.
886  
887  **Parameters:**
888  
889  - **split_length** (<code>int</code>) – The maximum length of each chunk by default in words, but can be in characters or tokens.
890    See the `split_units` parameter.
891  - **split_overlap** (<code>int</code>) – The number of characters to overlap between consecutive chunks.
892  - **split_unit** (<code>Literal['word', 'char', 'token']</code>) – The unit of the split_length parameter. It can be either "word", "char", or "token".
893    If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
894  - **separators** (<code>list\[str\] | None</code>) – An optional list of separator strings to use for splitting the text. The string
895    separators will be treated as regular expressions unless the separator is "sentence", in that case the
896    text will be split into sentences using a custom sentence tokenizer based on NLTK.
897    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
898    If no separators are provided, the default separators ["\\n\\n", "sentence", "\\n", " "] are used.
899  - **sentence_splitter_params** (<code>dict\[str, Any\] | None</code>) – Optional parameters to pass to the sentence tokenizer.
900    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
901  
902  **Raises:**
903  
904  - <code>ValueError</code> – If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
905    if any separator is not a string.
906  
907  #### warm_up
908  
909  ```python
910  warm_up() -> None
911  ```
912  
913  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
914  
915  #### run
916  
917  ```python
918  run(documents: list[Document]) -> dict[str, list[Document]]
919  ```
920  
921  Split a list of documents into documents with smaller chunks of text.
922  
923  **Parameters:**
924  
925  - **documents** (<code>list\[Document\]</code>) – List of Documents to split.
926  
927  **Returns:**
928  
929  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
930    to the input documents.
931  
932  ## text_cleaner
933  
934  ### TextCleaner
935  
936  Cleans text strings.
937  
938  It can remove substrings matching a list of regular expressions, convert text to lowercase,
939  remove punctuation, and remove numbers.
940  Use it to clean up text data before evaluation.
941  
942  ### Usage example
943  
944  ```python
945  from haystack.components.preprocessors import TextCleaner
946  
947  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
948  
949  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
950  result = cleaner.run(texts=[text_to_clean])
951  ```
952  
953  #### __init__
954  
955  ```python
956  __init__(
957      remove_regexps: list[str] | None = None,
958      convert_to_lowercase: bool = False,
959      remove_punctuation: bool = False,
960      remove_numbers: bool = False,
961  )
962  ```
963  
964  Initializes the TextCleaner component.
965  
966  **Parameters:**
967  
968  - **remove_regexps** (<code>list\[str\] | None</code>) – A list of regex patterns to remove matching substrings from the text.
969  - **convert_to_lowercase** (<code>bool</code>) – If `True`, converts all characters to lowercase.
970  - **remove_punctuation** (<code>bool</code>) – If `True`, removes punctuation from the text.
971  - **remove_numbers** (<code>bool</code>) – If `True`, removes numerical digits from the text.
972  
973  #### run
974  
975  ```python
976  run(texts: list[str]) -> dict[str, Any]
977  ```
978  
979  Cleans up the given list of strings.
980  
981  **Parameters:**
982  
983  - **texts** (<code>list\[str\]</code>) – List of strings to clean.
984  
985  **Returns:**
986  
987  - <code>dict\[str, Any\]</code> – A dictionary with the following key:
988  - `texts`: the cleaned list of strings.