Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.25 / haystack-api / preprocessors_api.md
preprocessors_api.md
  1  ---
  2  title: "PreProcessors"
  3  id: preprocessors-api
  4  description: "Preprocess your Documents and texts. Clean, split, and more."
  5  slug: "/preprocessors-api"
  6  ---
  7  
  8  
  9  ## csv_document_cleaner
 10  
 11  ### CSVDocumentCleaner
 12  
 13  A component for cleaning CSV documents by removing empty rows and columns.
 14  
 15  This component processes CSV content stored in Documents, allowing
 16  for the optional ignoring of a specified number of rows and columns before performing
 17  the cleaning operation. Additionally, it provides options to keep document IDs and
 18  control whether empty rows and columns should be removed.
 19  
 20  #### __init__
 21  
 22  ```python
 23  __init__(
 24      *,
 25      ignore_rows: int = 0,
 26      ignore_columns: int = 0,
 27      remove_empty_rows: bool = True,
 28      remove_empty_columns: bool = True,
 29      keep_id: bool = False
 30  ) -> None
 31  ```
 32  
 33  Initializes the CSVDocumentCleaner component.
 34  
 35  **Parameters:**
 36  
 37  - **ignore_rows** (<code>int</code>) – Number of rows to ignore from the top of the CSV table before processing.
 38  - **ignore_columns** (<code>int</code>) – Number of columns to ignore from the left of the CSV table before processing.
 39  - **remove_empty_rows** (<code>bool</code>) – Whether to remove rows that are entirely empty.
 40  - **remove_empty_columns** (<code>bool</code>) – Whether to remove columns that are entirely empty.
 41  - **keep_id** (<code>bool</code>) – Whether to retain the original document ID in the output document.
 42  
 43  Rows and columns ignored using these parameters are preserved in the final output, meaning
 44  they are not considered when removing empty rows and columns.
 45  
 46  #### run
 47  
 48  ```python
 49  run(documents: list[Document]) -> dict[str, list[Document]]
 50  ```
 51  
 52  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
 53  
 54  **Parameters:**
 55  
 56  - **documents** (<code>list\[Document\]</code>) – List of Documents containing CSV-formatted content.
 57  
 58  **Returns:**
 59  
 60  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a list of cleaned Documents under the key "documents".
 61  
 62  Processing steps:
 63  
 64  1. Reads each document's content as a CSV table.
 65  1. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
 66  1. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
 67     `remove_empty_columns`).
 68  1. Reattaches the ignored rows and columns to maintain their original positions.
 69  1. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
 70     document ID.
 71  
 72  ## csv_document_splitter
 73  
 74  ### CSVDocumentSplitter
 75  
 76  A component for splitting CSV documents into sub-tables based on split arguments.
 77  
 78  The splitter supports two modes of operation:
 79  
 80  - identify consecutive empty rows or columns that exceed a given threshold
 81    and uses them as delimiters to segment the document into smaller tables.
 82  - split each row into a separate sub-table, represented as a Document.
 83  
 84  #### __init__
 85  
 86  ```python
 87  __init__(
 88      row_split_threshold: int | None = 2,
 89      column_split_threshold: int | None = 2,
 90      read_csv_kwargs: dict[str, Any] | None = None,
 91      split_mode: SplitMode = "threshold",
 92  ) -> None
 93  ```
 94  
 95  Initializes the CSVDocumentSplitter component.
 96  
 97  **Parameters:**
 98  
 99  - **row_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty rows required to trigger a split.
100  - **column_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty columns required to trigger a split.
101  - **read_csv_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to `pandas.read_csv`.
102    By default, the component with options:
103  - `header=None`
104  - `skip_blank_lines=False` to preserve blank lines
105  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
106    See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
107  - **split_mode** (<code>SplitMode</code>) – If `threshold`, the component will split the document based on the number of
108    consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
109    If `row-wise`, the component will split each row into a separate sub-table.
110  
111  #### run
112  
113  ```python
114  run(documents: list[Document]) -> dict[str, list[Document]]
115  ```
116  
117  Processes and splits a list of CSV documents into multiple sub-tables.
118  
119  **Splitting Process:**
120  
121  1. Applies a row-based split if `row_split_threshold` is provided.
122  1. Applies a column-based split if `column_split_threshold` is provided.
123  1. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
124     further fragmentation of any sub-tables that still contain empty sections.
125  1. Sorts the resulting sub-tables based on their original positions within the document.
126  
127  **Parameters:**
128  
129  - **documents** (<code>list\[Document\]</code>) – A list of Documents containing CSV-formatted content.
130    Each document is assumed to contain one or more tables separated by empty rows or columns.
131  
132  **Returns:**
133  
134  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
135    each representing an extracted sub-table from the original CSV.
136    The metadata of each document includes:
137    \- A field `source_id` to track the original document.
138    \- A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
139    \- A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
140    \- A field `split_id` to indicate the order of the split in the original document.
141    \- All other metadata copied from the original document.
142  
143  - If a document cannot be processed, it is returned unchanged.
144  
145  - The `meta` field from the original document is preserved in the split documents.
146  
147  ## document_cleaner
148  
149  ### DocumentCleaner
150  
151  Cleans the text in the documents.
152  
153  It removes extra whitespaces,
154  empty lines, specified substrings, regexes,
155  page headers and footers (in this order).
156  
157  ### Usage example:
158  
159  ```python
160  from haystack import Document
161  from haystack.components.preprocessors import DocumentCleaner
162  
163  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
164  
165  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
166  result = cleaner.run(documents=[doc])
167  
168  assert result["documents"][0].content == "This is a document to clean "
169  ```
170  
171  #### __init__
172  
173  ```python
174  __init__(
175      remove_empty_lines: bool = True,
176      remove_extra_whitespaces: bool = True,
177      remove_repeated_substrings: bool = False,
178      keep_id: bool = False,
179      remove_substrings: list[str] | None = None,
180      remove_regex: str | None = None,
181      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
182      ascii_only: bool = False,
183      strip_whitespaces: bool = False,
184      replace_regexes: dict[str, str] | None = None,
185  )
186  ```
187  
188  Initialize DocumentCleaner.
189  
190  **Parameters:**
191  
192  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
193  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
194  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings (headers and footers) from pages.
195    Pages must be separated by a form feed character "\\f",
196    which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
197  - **remove_substrings** (<code>list\[str\] | None</code>) – List of substrings to remove from the text.
198  - **remove_regex** (<code>str | None</code>) – Regex to match and replace substrings by "".
199  - **keep_id** (<code>bool</code>) – If `True`, keeps the IDs of the original documents.
200  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text.
201    Note: This will run before any other steps.
202  - **ascii_only** (<code>bool</code>) – Whether to convert the text to ASCII only.
203    Will remove accents from characters and replace them with ASCII characters.
204    Other non-ASCII characters will be removed.
205    Note: This will run before any pattern matching or removal.
206  - **strip_whitespaces** (<code>bool</code>) – If `True`, removes leading and trailing whitespace from the document content
207    using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning
208    and end of the text, preserving internal whitespace (useful for markdown formatting).
209  - **replace_regexes** (<code>dict\[str, str\] | None</code>) – A dictionary mapping regex patterns to their replacement strings.
210    For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline.
211    This is applied after `remove_regex` and allows custom replacements instead of just removal.
212  
213  #### run
214  
215  ```python
216  run(documents: list[Document])
217  ```
218  
219  Cleans up the documents.
220  
221  **Parameters:**
222  
223  - **documents** (<code>list\[Document\]</code>) – List of Documents to clean.
224  
225  **Returns:**
226  
227  - – A dictionary with the following key:
228  - `documents`: List of cleaned Documents.
229  
230  **Raises:**
231  
232  - <code>TypeError</code> – if documents is not a list of Documents.
233  
234  ## document_preprocessor
235  
236  ### DocumentPreprocessor
237  
238  A SuperComponent that first splits and then cleans documents.
239  
240  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
241  It takes a list of documents as input and returns a processed list of documents.
242  
243  Usage example:
244  
245  ```python
246  from haystack import Document
247  from haystack.components.preprocessors import DocumentPreprocessor
248  
249  doc = Document(content="I love pizza!")
250  preprocessor = DocumentPreprocessor()
251  result = preprocessor.run(documents=[doc])
252  print(result["documents"])
253  ```
254  
255  #### __init__
256  
257  ```python
258  __init__(
259      *,
260      split_by: Literal[
261          "function", "page", "passage", "period", "word", "line", "sentence"
262      ] = "word",
263      split_length: int = 250,
264      split_overlap: int = 0,
265      split_threshold: int = 0,
266      splitting_function: Callable[[str], list[str]] | None = None,
267      respect_sentence_boundary: bool = False,
268      language: Language = "en",
269      use_split_rules: bool = True,
270      extend_abbreviations: bool = True,
271      remove_empty_lines: bool = True,
272      remove_extra_whitespaces: bool = True,
273      remove_repeated_substrings: bool = False,
274      keep_id: bool = False,
275      remove_substrings: list[str] | None = None,
276      remove_regex: str | None = None,
277      unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None,
278      ascii_only: bool = False
279  ) -> None
280  ```
281  
282  Initialize a DocumentPreProcessor that first splits and then cleans documents.
283  
284  **Splitter Parameters**:
285  
286  **Parameters:**
287  
288  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
289  - **split_length** (<code>int</code>) – The maximum number of units (words, lines, pages, and so on) in each split.
290  - **split_overlap** (<code>int</code>) – The number of overlapping units between consecutive splits.
291  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split is smaller than this, it's merged
292    with the previous split.
293  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – A custom function for splitting if `split_by="function"`.
294  - **respect_sentence_boundary** (<code>bool</code>) – If `True`, splits by words but tries not to break inside a sentence.
295  - **language** (<code>Language</code>) – Language used by the sentence tokenizer if `split_by="sentence"` or
296    `respect_sentence_boundary=True`.
297  - **use_split_rules** (<code>bool</code>) – Whether to apply additional splitting heuristics for the sentence splitter.
298  - **extend_abbreviations** (<code>bool</code>) – Whether to extend the sentence splitter with curated abbreviations for certain
299    languages.
300  
301  **Cleaner Parameters**:
302  
303  - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines.
304  - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces.
305  - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings like headers/footers across pages.
306  - **keep_id** (<code>bool</code>) – If `True`, keeps the original document IDs.
307  - **remove_substrings** (<code>list\[str\] | None</code>) – A list of strings to remove from the document content.
308  - **remove_regex** (<code>str | None</code>) – A regex pattern whose matches will be removed from the document content.
309  - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text, for example `"NFC"`.
310  - **ascii_only** (<code>bool</code>) – If `True`, converts text to ASCII only.
311  
312  #### to_dict
313  
314  ```python
315  to_dict() -> dict[str, Any]
316  ```
317  
318  Serialize SuperComponent to a dictionary.
319  
320  **Returns:**
321  
322  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
323  
324  #### from_dict
325  
326  ```python
327  from_dict(data: dict[str, Any]) -> DocumentPreprocessor
328  ```
329  
330  Deserializes the SuperComponent from a dictionary.
331  
332  **Parameters:**
333  
334  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
335  
336  **Returns:**
337  
338  - <code>DocumentPreprocessor</code> – Deserialized SuperComponent.
339  
340  ## document_splitter
341  
342  ### DocumentSplitter
343  
344  Splits long documents into smaller chunks.
345  
346  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
347  and prevents exceeding language model context limits.
348  
349  The DocumentSplitter is compatible with the following DocumentStores:
350  
351  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
352  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
353    not stored
354  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
355  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
356  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
357  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
358    information is not stored
359  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
360  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
361  
362  ### Usage example
363  
364  ```python
365  from haystack import Document
366  from haystack.components.preprocessors import DocumentSplitter
367  
368  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
369  
370  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
371  result = splitter.run(documents=[doc])
372  ```
373  
374  #### __init__
375  
376  ```python
377  __init__(
378      split_by: Literal[
379          "function", "page", "passage", "period", "word", "line", "sentence"
380      ] = "word",
381      split_length: int = 200,
382      split_overlap: int = 0,
383      split_threshold: int = 0,
384      splitting_function: Callable[[str], list[str]] | None = None,
385      respect_sentence_boundary: bool = False,
386      language: Language = "en",
387      use_split_rules: bool = True,
388      extend_abbreviations: bool = True,
389      *,
390      skip_empty_documents: bool = True
391  )
392  ```
393  
394  Initialize DocumentSplitter.
395  
396  **Parameters:**
397  
398  - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit for splitting your documents. Choose from:
399  - `word` for splitting by spaces (" ")
400  - `period` for splitting by periods (".")
401  - `page` for splitting by form feed ("\\f")
402  - `passage` for splitting by double line breaks ("\\n\\n")
403  - `line` for splitting each line ("\\n")
404  - `sentence` for splitting by NLTK sentence tokenizer
405  - **split_length** (<code>int</code>) – The maximum number of units in each split.
406  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
407  - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
408    than the threshold, it's attached to the previous split.
409  - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – Necessary when `split_by` is set to "function".
410    This is a function which must accept a single `str` as input and return a `list` of `str` as output,
411    representing the chunks after splitting.
412  - **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
413    If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
414  - **language** (<code>Language</code>) – Choose the language for the NLTK tokenizer. The default is English ("en").
415  - **use_split_rules** (<code>bool</code>) – Choose whether to use additional split rules when splitting by `sentence`.
416  - **extend_abbreviations** (<code>bool</code>) – Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
417    of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
418  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
419    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
420    from non-textual documents.
421  
422  #### warm_up
423  
424  ```python
425  warm_up()
426  ```
427  
428  Warm up the DocumentSplitter by loading the sentence tokenizer.
429  
430  #### run
431  
432  ```python
433  run(documents: list[Document])
434  ```
435  
436  Split documents into smaller parts.
437  
438  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
439  and an overlap of `split_overlap`.
440  
441  **Parameters:**
442  
443  - **documents** (<code>list\[Document\]</code>) – The documents to split.
444  
445  **Returns:**
446  
447  - – A dictionary with the following key:
448  - `documents`: List of documents with the split texts. Each document includes:
449    - A metadata field `source_id` to track the original document.
450    - A metadata field `page_number` to track the original page number.
451    - All other metadata copied from the original document.
452  
453  **Raises:**
454  
455  - <code>TypeError</code> – if the input is not a list of Documents.
456  - <code>ValueError</code> – if the content of a document is None.
457  
458  #### to_dict
459  
460  ```python
461  to_dict() -> dict[str, Any]
462  ```
463  
464  Serializes the component to a dictionary.
465  
466  #### from_dict
467  
468  ```python
469  from_dict(data: dict[str, Any]) -> DocumentSplitter
470  ```
471  
472  Deserializes the component from a dictionary.
473  
474  ## embedding_based_document_splitter
475  
476  ### EmbeddingBasedDocumentSplitter
477  
478  Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
479  
480  This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
481  and then uses cosine distance between sequential embeddings to determine split points. Any distance above
482  the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
483  characters (``) in the original document.
484  
485  This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
486  
487  ### Usage example
488  
489  ```python
490  from haystack import Document
491  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
492  from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
493  
494  # Create a document with content that has a clear topic shift
495  doc = Document(
496      content="This is a first sentence. This is a second sentence. This is a third sentence. "
497      "Completely different topic. The same completely different topic."
498  )
499  
500  # Initialize the embedder to calculate semantic similarities
501  embedder = SentenceTransformersDocumentEmbedder()
502  
503  # Configure the splitter with parameters that control splitting behavior
504  splitter = EmbeddingBasedDocumentSplitter(
505      document_embedder=embedder,
506      sentences_per_group=2,      # Group 2 sentences before calculating embeddings
507      percentile=0.95,            # Split when cosine distance exceeds 95th percentile
508      min_length=50,              # Merge splits shorter than 50 characters
509      max_length=1000             # Further split chunks longer than 1000 characters
510  )
511  result = splitter.run(documents=[doc])
512  
513  # The result contains a list of Document objects, each representing a semantic chunk
514  # Each split document includes metadata: source_id, split_id, and page_number
515  print(f"Original document split into {len(result['documents'])} chunks")
516  for i, split_doc in enumerate(result['documents']):
517      print(f"Chunk {i}: {split_doc.content[:50]}...")
518  ```
519  
520  #### __init__
521  
522  ```python
523  __init__(
524      *,
525      document_embedder: DocumentEmbedder,
526      sentences_per_group: int = 3,
527      percentile: float = 0.95,
528      min_length: int = 50,
529      max_length: int = 1000,
530      language: Language = "en",
531      use_split_rules: bool = True,
532      extend_abbreviations: bool = True
533  )
534  ```
535  
536  Initialize EmbeddingBasedDocumentSplitter.
537  
538  **Parameters:**
539  
540  - **document_embedder** (<code>DocumentEmbedder</code>) – The DocumentEmbedder to use for calculating embeddings.
541  - **sentences_per_group** (<code>int</code>) – Number of sentences to group together before embedding.
542  - **percentile** (<code>float</code>) – Percentile threshold for cosine distance. Distances above this percentile
543    are treated as break points.
544  - **min_length** (<code>int</code>) – Minimum length of splits in characters. Splits below this length will be merged.
545  - **max_length** (<code>int</code>) – Maximum length of splits in characters. Splits above this length will be recursively split.
546  - **language** (<code>Language</code>) – Language for sentence tokenization.
547  - **use_split_rules** (<code>bool</code>) – Whether to use additional split rules for sentence tokenization. Applies additional
548    split rules from SentenceSplitter to the sentence spans.
549  - **extend_abbreviations** (<code>bool</code>) – If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list
550    of curated abbreviations. Currently supported languages are: en, de.
551    If False, the default abbreviations are used.
552  
553  #### warm_up
554  
555  ```python
556  warm_up() -> None
557  ```
558  
559  Warm up the component by initializing the sentence splitter.
560  
561  #### run
562  
563  ```python
564  run(documents: list[Document]) -> dict[str, list[Document]]
565  ```
566  
567  Split documents based on embedding similarity.
568  
569  **Parameters:**
570  
571  - **documents** (<code>list\[Document\]</code>) – The documents to split.
572  
573  **Returns:**
574  
575  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
576  - `documents`: List of documents with the split texts. Each document includes:
577    - A metadata field `source_id` to track the original document.
578    - A metadata field `split_id` to track the split number.
579    - A metadata field `page_number` to track the original page number.
580    - All other metadata copied from the original document.
581  
582  **Raises:**
583  
584  - <code>RuntimeError</code> – If the component wasn't warmed up.
585  - <code>TypeError</code> – If the input is not a list of Documents.
586  - <code>ValueError</code> – If the document content is None or empty.
587  
588  #### to_dict
589  
590  ```python
591  to_dict() -> dict[str, Any]
592  ```
593  
594  Serializes the component to a dictionary.
595  
596  **Returns:**
597  
598  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
599  
600  #### from_dict
601  
602  ```python
603  from_dict(data: dict[str, Any]) -> EmbeddingBasedDocumentSplitter
604  ```
605  
606  Deserializes the component from a dictionary.
607  
608  **Parameters:**
609  
610  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
611  
612  **Returns:**
613  
614  - <code>EmbeddingBasedDocumentSplitter</code> – The deserialized component.
615  
616  ## hierarchical_document_splitter
617  
618  ### HierarchicalDocumentSplitter
619  
620  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
621  
622  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
623  are connected such that the smaller blocks are children of the parent-larger blocks.
624  
625  ## Usage example
626  
627  ```python
628  from haystack import Document
629  from haystack.components.preprocessors import HierarchicalDocumentSplitter
630  
631  doc = Document(content="This is a simple test document")
632  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
633  splitter.run([doc])
634  >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
635  >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
636  >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
637  >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
638  >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
639  >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
640  >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
641  ```
642  
643  #### __init__
644  
645  ```python
646  __init__(
647      block_sizes: set[int],
648      split_overlap: int = 0,
649      split_by: Literal["word", "sentence", "page", "passage"] = "word",
650  )
651  ```
652  
653  Initialize HierarchicalDocumentSplitter.
654  
655  **Parameters:**
656  
657  - **block_sizes** (<code>set\[int\]</code>) – Set of block sizes to split the document into. The blocks are split in descending order.
658  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
659  - **split_by** (<code>Literal['word', 'sentence', 'page', 'passage']</code>) – The unit for splitting your documents.
660  
661  #### run
662  
663  ```python
664  run(documents: list[Document])
665  ```
666  
667  Builds a hierarchical document structure for each document in a list of documents.
668  
669  **Parameters:**
670  
671  - **documents** (<code>list\[Document\]</code>) – List of Documents to split into hierarchical blocks.
672  
673  **Returns:**
674  
675  - – List of HierarchicalDocument
676  
677  #### build_hierarchy_from_doc
678  
679  ```python
680  build_hierarchy_from_doc(document: Document) -> list[Document]
681  ```
682  
683  Build a hierarchical tree document structure from a single document.
684  
685  Given a document, this function splits the document into hierarchical blocks of different sizes represented
686  as HierarchicalDocument objects.
687  
688  **Parameters:**
689  
690  - **document** (<code>Document</code>) – Document to split into hierarchical blocks.
691  
692  **Returns:**
693  
694  - <code>list\[Document\]</code> – List of HierarchicalDocument
695  
696  #### to_dict
697  
698  ```python
699  to_dict() -> dict[str, Any]
700  ```
701  
702  Returns a dictionary representation of the component.
703  
704  **Returns:**
705  
706  - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component.
707  
708  #### from_dict
709  
710  ```python
711  from_dict(data: dict[str, Any]) -> HierarchicalDocumentSplitter
712  ```
713  
714  Deserialize this component from a dictionary.
715  
716  **Parameters:**
717  
718  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component.
719  
720  **Returns:**
721  
722  - <code>HierarchicalDocumentSplitter</code> – The deserialized component.
723  
724  ## markdown_header_splitter
725  
726  ### MarkdownHeaderSplitter
727  
728  Split documents at ATX-style Markdown headers (#), with optional secondary splitting.
729  
730  This component processes text documents by:
731  
732  - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata.
733  - Optionally applying a secondary split (by word, passage, period, or line) to each chunk
734    (using haystack's DocumentSplitter).
735  - Preserving and propagating metadata such as parent headers, page numbers, and split IDs.
736  
737  #### __init__
738  
739  ```python
740  __init__(
741      *,
742      page_break_character: str = "\x0c",
743      keep_headers: bool = True,
744      secondary_split: Literal["word", "passage", "period", "line"] | None = None,
745      split_length: int = 200,
746      split_overlap: int = 0,
747      split_threshold: int = 0,
748      skip_empty_documents: bool = True
749  )
750  ```
751  
752  Initialize the MarkdownHeaderSplitter.
753  
754  **Parameters:**
755  
756  - **page_break_character** (<code>str</code>) – Character used to identify page breaks. Defaults to form feed ("").
757  - **keep_headers** (<code>bool</code>) – If True, headers are kept in the content. If False, headers are moved to metadata.
758    Defaults to True.
759  - **secondary_split** (<code>Literal['word', 'passage', 'period', 'line'] | None</code>) – Optional secondary split condition after header splitting.
760    Options are None, "word", "passage", "period", "line". Defaults to None.
761  - **split_length** (<code>int</code>) – The maximum number of units in each split when using secondary splitting. Defaults to 200.
762  - **split_overlap** (<code>int</code>) – The number of overlapping units for each split when using secondary splitting.
763    Defaults to 0.
764  - **split_threshold** (<code>int</code>) – The minimum number of units per split when using secondary splitting. Defaults to 0.
765  - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True.
766    Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
767    from non-textual documents.
768  
769  #### warm_up
770  
771  ```python
772  warm_up()
773  ```
774  
775  Warm up the MarkdownHeaderSplitter.
776  
777  #### run
778  
779  ```python
780  run(documents: list[Document]) -> dict[str, list[Document]]
781  ```
782  
783  Run the markdown header splitter with optional secondary splitting.
784  
785  **Parameters:**
786  
787  - **documents** (<code>list\[Document\]</code>) – List of documents to split
788  
789  **Returns:**
790  
791  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
792  - `documents`: List of documents with the split texts. Each document includes:
793    - A metadata field `source_id` to track the original document.
794    - A metadata field `page_number` to track the original page number.
795    - A metadata field `split_id` to identify the split chunk index within its parent document.
796    - All other metadata copied from the original document.
797  
798  **Raises:**
799  
800  - <code>ValueError</code> – If a document has `None` content.
801  - <code>TypeError</code> – If a document's content is not a string.
802  
803  ## recursive_splitter
804  
805  ### RecursiveDocumentSplitter
806  
807  Recursively chunk text into smaller chunks.
808  
809  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
810  to the text.
811  
812  The separators are applied in the order they are provided, typically this is a list of separators that are
813  applied in a specific order, being the last separator the most specific one.
814  
815  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
816  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
817  list to the remaining text.
818  
819  This is done until all chunks are smaller than the split_length parameter.
820  
821  Example:
822  
823  ```python
824  from haystack import Document
825  from haystack.components.preprocessors import RecursiveDocumentSplitter
826  
827  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
828  text = ('''Artificial intelligence (AI) - Introduction
829  
830  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
831  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
832  doc = Document(content=text)
833  doc_chunks = chunker.run([doc])
834  print(doc_chunks["documents"])
835  >[
836  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
837  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
838  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
839  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
840  >]
841  ```
842  
843  #### __init__
844  
845  ```python
846  __init__(
847      *,
848      split_length: int = 200,
849      split_overlap: int = 0,
850      split_unit: Literal["word", "char", "token"] = "word",
851      separators: list[str] | None = None,
852      sentence_splitter_params: dict[str, Any] | None = None
853  )
854  ```
855  
856  Initializes a RecursiveDocumentSplitter.
857  
858  **Parameters:**
859  
860  - **split_length** (<code>int</code>) – The maximum length of each chunk by default in words, but can be in characters or tokens.
861    See the `split_units` parameter.
862  - **split_overlap** (<code>int</code>) – The number of characters to overlap between consecutive chunks.
863  - **split_unit** (<code>Literal['word', 'char', 'token']</code>) – The unit of the split_length parameter. It can be either "word", "char", or "token".
864    If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
865  - **separators** (<code>list\[str\] | None</code>) – An optional list of separator strings to use for splitting the text. The string
866    separators will be treated as regular expressions unless the separator is "sentence", in that case the
867    text will be split into sentences using a custom sentence tokenizer based on NLTK.
868    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
869    If no separators are provided, the default separators ["\\n\\n", "sentence", "\\n", " "] are used.
870  - **sentence_splitter_params** (<code>dict\[str, Any\] | None</code>) – Optional parameters to pass to the sentence tokenizer.
871    See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
872  
873  **Raises:**
874  
875  - <code>ValueError</code> – If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
876    if any separator is not a string.
877  
878  #### warm_up
879  
880  ```python
881  warm_up() -> None
882  ```
883  
884  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
885  
886  #### run
887  
888  ```python
889  run(documents: list[Document]) -> dict[str, list[Document]]
890  ```
891  
892  Split a list of documents into documents with smaller chunks of text.
893  
894  **Parameters:**
895  
896  - **documents** (<code>list\[Document\]</code>) – List of Documents to split.
897  
898  **Returns:**
899  
900  - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
901    to the input documents.
902  
903  ## text_cleaner
904  
905  ### TextCleaner
906  
907  Cleans text strings.
908  
909  It can remove substrings matching a list of regular expressions, convert text to lowercase,
910  remove punctuation, and remove numbers.
911  Use it to clean up text data before evaluation.
912  
913  ### Usage example
914  
915  ```python
916  from haystack.components.preprocessors import TextCleaner
917  
918  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
919  
920  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
921  result = cleaner.run(texts=[text_to_clean])
922  ```
923  
924  #### __init__
925  
926  ```python
927  __init__(
928      remove_regexps: list[str] | None = None,
929      convert_to_lowercase: bool = False,
930      remove_punctuation: bool = False,
931      remove_numbers: bool = False,
932  )
933  ```
934  
935  Initializes the TextCleaner component.
936  
937  **Parameters:**
938  
939  - **remove_regexps** (<code>list\[str\] | None</code>) – A list of regex patterns to remove matching substrings from the text.
940  - **convert_to_lowercase** (<code>bool</code>) – If `True`, converts all characters to lowercase.
941  - **remove_punctuation** (<code>bool</code>) – If `True`, removes punctuation from the text.
942  - **remove_numbers** (<code>bool</code>) – If `True`, removes numerical digits from the text.
943  
944  #### run
945  
946  ```python
947  run(texts: list[str]) -> dict[str, Any]
948  ```
949  
950  Cleans up the given list of strings.
951  
952  **Parameters:**
953  
954  - **texts** (<code>list\[str\]</code>) – List of strings to clean.
955  
956  **Returns:**
957  
958  - <code>dict\[str, Any\]</code> – A dictionary with the following key:
959  - `texts`: the cleaned list of strings.