Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.23 / haystack-api / preprocessors_api.md
preprocessors_api.md
  1  ---
  2  title: "PreProcessors"
  3  id: preprocessors-api
  4  description: "Preprocess your Documents and texts. Clean, split, and more."
  5  slug: "/preprocessors-api"
  6  ---
  7  
  8  <a id="csv_document_cleaner"></a>
  9  
 10  ## Module csv\_document\_cleaner
 11  
 12  <a id="csv_document_cleaner.CSVDocumentCleaner"></a>
 13  
 14  ### CSVDocumentCleaner
 15  
 16  A component for cleaning CSV documents by removing empty rows and columns.
 17  
 18  This component processes CSV content stored in Documents, allowing
 19  for the optional ignoring of a specified number of rows and columns before performing
 20  the cleaning operation. Additionally, it provides options to keep document IDs and
 21  control whether empty rows and columns should be removed.
 22  
 23  <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a>
 24  
 25  #### CSVDocumentCleaner.\_\_init\_\_
 26  
 27  ```python
 28  def __init__(*,
 29               ignore_rows: int = 0,
 30               ignore_columns: int = 0,
 31               remove_empty_rows: bool = True,
 32               remove_empty_columns: bool = True,
 33               keep_id: bool = False) -> None
 34  ```
 35  
 36  Initializes the CSVDocumentCleaner component.
 37  
 38  **Arguments**:
 39  
 40  - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing.
 41  - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing.
 42  - `remove_empty_rows`: Whether to remove rows that are entirely empty.
 43  - `remove_empty_columns`: Whether to remove columns that are entirely empty.
 44  - `keep_id`: Whether to retain the original document ID in the output document.
 45  Rows and columns ignored using these parameters are preserved in the final output, meaning
 46  they are not considered when removing empty rows and columns.
 47  
 48  <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a>
 49  
 50  #### CSVDocumentCleaner.run
 51  
 52  ```python
 53  @component.output_types(documents=list[Document])
 54  def run(documents: list[Document]) -> dict[str, list[Document]]
 55  ```
 56  
 57  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
 58  
 59  **Arguments**:
 60  
 61  - `documents`: List of Documents containing CSV-formatted content.
 62  
 63  **Returns**:
 64  
 65  A dictionary with a list of cleaned Documents under the key "documents".
 66  Processing steps:
 67  1. Reads each document's content as a CSV table.
 68  2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
 69  3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
 70      `remove_empty_columns`).
 71  4. Reattaches the ignored rows and columns to maintain their original positions.
 72  5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
 73      document ID.
 74  
 75  <a id="csv_document_splitter"></a>
 76  
 77  ## Module csv\_document\_splitter
 78  
 79  <a id="csv_document_splitter.CSVDocumentSplitter"></a>
 80  
 81  ### CSVDocumentSplitter
 82  
 83  A component for splitting CSV documents into sub-tables based on split arguments.
 84  
 85  The splitter supports two modes of operation:
 86  - identify consecutive empty rows or columns that exceed a given threshold
 87  and uses them as delimiters to segment the document into smaller tables.
 88  - split each row into a separate sub-table, represented as a Document.
 89  
 90  <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a>
 91  
 92  #### CSVDocumentSplitter.\_\_init\_\_
 93  
 94  ```python
 95  def __init__(row_split_threshold: int | None = 2,
 96               column_split_threshold: int | None = 2,
 97               read_csv_kwargs: dict[str, Any] | None = None,
 98               split_mode: SplitMode = "threshold") -> None
 99  ```
100  
101  Initializes the CSVDocumentSplitter component.
102  
103  **Arguments**:
104  
105  - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split.
106  - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split.
107  - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`.
108  By default, the component with options:
109  - `header=None`
110  - `skip_blank_lines=False` to preserve blank lines
111  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
112  See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
113  - `split_mode`: If `threshold`, the component will split the document based on the number of
114  consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
115  If `row-wise`, the component will split each row into a separate sub-table.
116  
117  <a id="csv_document_splitter.CSVDocumentSplitter.run"></a>
118  
119  #### CSVDocumentSplitter.run
120  
121  ```python
122  @component.output_types(documents=list[Document])
123  def run(documents: list[Document]) -> dict[str, list[Document]]
124  ```
125  
126  Processes and splits a list of CSV documents into multiple sub-tables.
127  
128  **Splitting Process:**
129  1. Applies a row-based split if `row_split_threshold` is provided.
130  2. Applies a column-based split if `column_split_threshold` is provided.
131  3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
132     further fragmentation of any sub-tables that still contain empty sections.
133  4. Sorts the resulting sub-tables based on their original positions within the document.
134  
135  **Arguments**:
136  
137  - `documents`: A list of Documents containing CSV-formatted content.
138  Each document is assumed to contain one or more tables separated by empty rows or columns.
139  
140  **Returns**:
141  
142  A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
143  each representing an extracted sub-table from the original CSV.
144      The metadata of each document includes:
145          - A field `source_id` to track the original document.
146          - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
147          - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
148          - A field `split_id` to indicate the order of the split in the original document.
149          - All other metadata copied from the original document.
150  
151  - If a document cannot be processed, it is returned unchanged.
152  - The `meta` field from the original document is preserved in the split documents.
153  
154  <a id="document_cleaner"></a>
155  
156  ## Module document\_cleaner
157  
158  <a id="document_cleaner.DocumentCleaner"></a>
159  
160  ### DocumentCleaner
161  
162  Cleans the text in the documents.
163  
164  It removes extra whitespaces,
165  empty lines, specified substrings, regexes,
166  page headers and footers (in this order).
167  
168  ### Usage example:
169  
170  ```python
171  from haystack import Document
172  from haystack.components.preprocessors import DocumentCleaner
173  
174  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
175  
176  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
177  result = cleaner.run(documents=[doc])
178  
179  assert result["documents"][0].content == "This is a document to clean "
180  ```
181  
182  <a id="document_cleaner.DocumentCleaner.__init__"></a>
183  
184  #### DocumentCleaner.\_\_init\_\_
185  
186  ```python
187  def __init__(remove_empty_lines: bool = True,
188               remove_extra_whitespaces: bool = True,
189               remove_repeated_substrings: bool = False,
190               keep_id: bool = False,
191               remove_substrings: list[str] | None = None,
192               remove_regex: str | None = None,
193               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
194               | None = None,
195               ascii_only: bool = False)
196  ```
197  
198  Initialize DocumentCleaner.
199  
200  **Arguments**:
201  
202  - `remove_empty_lines`: If `True`, removes empty lines.
203  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
204  - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages.
205  Pages must be separated by a form feed character "\f",
206  which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
207  - `remove_substrings`: List of substrings to remove from the text.
208  - `remove_regex`: Regex to match and replace substrings by "".
209  - `keep_id`: If `True`, keeps the IDs of the original documents.
210  - `unicode_normalization`: Unicode normalization form to apply to the text.
211  Note: This will run before any other steps.
212  - `ascii_only`: Whether to convert the text to ASCII only.
213  Will remove accents from characters and replace them with ASCII characters.
214  Other non-ASCII characters will be removed.
215  Note: This will run before any pattern matching or removal.
216  
217  <a id="document_cleaner.DocumentCleaner.run"></a>
218  
219  #### DocumentCleaner.run
220  
221  ```python
222  @component.output_types(documents=list[Document])
223  def run(documents: list[Document])
224  ```
225  
226  Cleans up the documents.
227  
228  **Arguments**:
229  
230  - `documents`: List of Documents to clean.
231  
232  **Raises**:
233  
234  - `TypeError`: if documents is not a list of Documents.
235  
236  **Returns**:
237  
238  A dictionary with the following key:
239  - `documents`: List of cleaned Documents.
240  
241  <a id="document_preprocessor"></a>
242  
243  ## Module document\_preprocessor
244  
245  <a id="document_preprocessor.DocumentPreprocessor"></a>
246  
247  ### DocumentPreprocessor
248  
249  A SuperComponent that first splits and then cleans documents.
250  
251  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
252  It takes a list of documents as input and returns a processed list of documents.
253  
254  Usage example:
255  ```python
256  from haystack import Document
257  from haystack.components.preprocessors import DocumentPreprocessor
258  
259  doc = Document(content="I love pizza!")
260  preprocessor = DocumentPreprocessor()
261  result = preprocessor.run(documents=[doc])
262  print(result["documents"])
263  ```
264  
265  <a id="document_preprocessor.DocumentPreprocessor.__init__"></a>
266  
267  #### DocumentPreprocessor.\_\_init\_\_
268  
269  ```python
270  def __init__(*,
271               split_by: Literal["function", "page", "passage", "period", "word",
272                                 "line", "sentence"] = "word",
273               split_length: int = 250,
274               split_overlap: int = 0,
275               split_threshold: int = 0,
276               splitting_function: Callable[[str], list[str]] | None = None,
277               respect_sentence_boundary: bool = False,
278               language: Language = "en",
279               use_split_rules: bool = True,
280               extend_abbreviations: bool = True,
281               remove_empty_lines: bool = True,
282               remove_extra_whitespaces: bool = True,
283               remove_repeated_substrings: bool = False,
284               keep_id: bool = False,
285               remove_substrings: list[str] | None = None,
286               remove_regex: str | None = None,
287               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
288               | None = None,
289               ascii_only: bool = False) -> None
290  ```
291  
292  Initialize a DocumentPreProcessor that first splits and then cleans documents.
293  
294  **Splitter Parameters**:
295  
296  **Arguments**:
297  
298  - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
299  - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split.
300  - `split_overlap`: The number of overlapping units between consecutive splits.
301  - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged
302  with the previous split.
303  - `splitting_function`: A custom function for splitting if `split_by="function"`.
304  - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence.
305  - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or
306  `respect_sentence_boundary=True`.
307  - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter.
308  - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain
309  languages.
310  
311  **Cleaner Parameters**:
312  - `remove_empty_lines`: If `True`, removes empty lines.
313  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
314  - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages.
315  - `keep_id`: If `True`, keeps the original document IDs.
316  - `remove_substrings`: A list of strings to remove from the document content.
317  - `remove_regex`: A regex pattern whose matches will be removed from the document content.
318  - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`.
319  - `ascii_only`: If `True`, converts text to ASCII only.
320  
321  <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a>
322  
323  #### DocumentPreprocessor.to\_dict
324  
325  ```python
326  def to_dict() -> dict[str, Any]
327  ```
328  
329  Serialize SuperComponent to a dictionary.
330  
331  **Returns**:
332  
333  Dictionary with serialized data.
334  
335  <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a>
336  
337  #### DocumentPreprocessor.from\_dict
338  
339  ```python
340  @classmethod
341  def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor"
342  ```
343  
344  Deserializes the SuperComponent from a dictionary.
345  
346  **Arguments**:
347  
348  - `data`: Dictionary to deserialize from.
349  
350  **Returns**:
351  
352  Deserialized SuperComponent.
353  
354  <a id="document_splitter"></a>
355  
356  ## Module document\_splitter
357  
358  <a id="document_splitter.DocumentSplitter"></a>
359  
360  ### DocumentSplitter
361  
362  Splits long documents into smaller chunks.
363  
364  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
365  and prevents exceeding language model context limits.
366  
367  The DocumentSplitter is compatible with the following DocumentStores:
368  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
369  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
370    not stored
371  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
372  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
373  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
374  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
375     information is not stored
376  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
377  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
378  
379  ### Usage example
380  
381  ```python
382  from haystack import Document
383  from haystack.components.preprocessors import DocumentSplitter
384  
385  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
386  
387  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
388  result = splitter.run(documents=[doc])
389  ```
390  
391  <a id="document_splitter.DocumentSplitter.__init__"></a>
392  
393  #### DocumentSplitter.\_\_init\_\_
394  
395  ```python
396  def __init__(split_by: Literal["function", "page", "passage", "period", "word",
397                                 "line", "sentence"] = "word",
398               split_length: int = 200,
399               split_overlap: int = 0,
400               split_threshold: int = 0,
401               splitting_function: Callable[[str], list[str]] | None = None,
402               respect_sentence_boundary: bool = False,
403               language: Language = "en",
404               use_split_rules: bool = True,
405               extend_abbreviations: bool = True,
406               *,
407               skip_empty_documents: bool = True)
408  ```
409  
410  Initialize DocumentSplitter.
411  
412  **Arguments**:
413  
414  - `split_by`: The unit for splitting your documents. Choose from:
415  - `word` for splitting by spaces (" ")
416  - `period` for splitting by periods (".")
417  - `page` for splitting by form feed ("\f")
418  - `passage` for splitting by double line breaks ("\n\n")
419  - `line` for splitting each line ("\n")
420  - `sentence` for splitting by NLTK sentence tokenizer
421  - `split_length`: The maximum number of units in each split.
422  - `split_overlap`: The number of overlapping units for each split.
423  - `split_threshold`: The minimum number of units per split. If a split has fewer units
424  than the threshold, it's attached to the previous split.
425  - `splitting_function`: Necessary when `split_by` is set to "function".
426  This is a function which must accept a single `str` as input and return a `list` of `str` as output,
427  representing the chunks after splitting.
428  - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
429  If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
430  - `language`: Choose the language for the NLTK tokenizer. The default is English ("en").
431  - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`.
432  - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
433  of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
434  - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True.
435  Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
436  from non-textual documents.
437  
438  <a id="document_splitter.DocumentSplitter.warm_up"></a>
439  
440  #### DocumentSplitter.warm\_up
441  
442  ```python
443  def warm_up()
444  ```
445  
446  Warm up the DocumentSplitter by loading the sentence tokenizer.
447  
448  <a id="document_splitter.DocumentSplitter.run"></a>
449  
450  #### DocumentSplitter.run
451  
452  ```python
453  @component.output_types(documents=list[Document])
454  def run(documents: list[Document])
455  ```
456  
457  Split documents into smaller parts.
458  
459  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
460  and an overlap of `split_overlap`.
461  
462  **Arguments**:
463  
464  - `documents`: The documents to split.
465  
466  **Raises**:
467  
468  - `TypeError`: if the input is not a list of Documents.
469  - `ValueError`: if the content of a document is None.
470  
471  **Returns**:
472  
473  A dictionary with the following key:
474  - `documents`: List of documents with the split texts. Each document includes:
475  - A metadata field `source_id` to track the original document.
476  - A metadata field `page_number` to track the original page number.
477  - All other metadata copied from the original document.
478  
479  <a id="document_splitter.DocumentSplitter.to_dict"></a>
480  
481  #### DocumentSplitter.to\_dict
482  
483  ```python
484  def to_dict() -> dict[str, Any]
485  ```
486  
487  Serializes the component to a dictionary.
488  
489  <a id="document_splitter.DocumentSplitter.from_dict"></a>
490  
491  #### DocumentSplitter.from\_dict
492  
493  ```python
494  @classmethod
495  def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter"
496  ```
497  
498  Deserializes the component from a dictionary.
499  
500  <a id="embedding_based_document_splitter"></a>
501  
502  ## Module embedding\_based\_document\_splitter
503  
504  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter"></a>
505  
506  ### EmbeddingBasedDocumentSplitter
507  
508  Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
509  
510  This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
511  and then uses cosine distance between sequential embeddings to determine split points. Any distance above
512  the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
513  characters (``) in the original document.
514  
515  This component is inspired by [5 Levels of Text Splitting](
516      https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
517  ) by Greg Kamradt.
518  
519  ### Usage example
520  
521  ```python
522  from haystack import Document
523  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
524  from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
525  
526  # Create a document with content that has a clear topic shift
527  doc = Document(
528      content="This is a first sentence. This is a second sentence. This is a third sentence. "
529      "Completely different topic. The same completely different topic."
530  )
531  
532  # Initialize the embedder to calculate semantic similarities
533  embedder = SentenceTransformersDocumentEmbedder()
534  
535  # Configure the splitter with parameters that control splitting behavior
536  splitter = EmbeddingBasedDocumentSplitter(
537      document_embedder=embedder,
538      sentences_per_group=2,      # Group 2 sentences before calculating embeddings
539      percentile=0.95,            # Split when cosine distance exceeds 95th percentile
540      min_length=50,              # Merge splits shorter than 50 characters
541      max_length=1000             # Further split chunks longer than 1000 characters
542  )
543  splitter.warm_up()
544  result = splitter.run(documents=[doc])
545  
546  # The result contains a list of Document objects, each representing a semantic chunk
547  # Each split document includes metadata: source_id, split_id, and page_number
548  print(f"Original document split into {len(result['documents'])} chunks")
549  for i, split_doc in enumerate(result['documents']):
550      print(f"Chunk {i}: {split_doc.content[:50]}...")
551  ```
552  
553  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.__init__"></a>
554  
555  #### EmbeddingBasedDocumentSplitter.\_\_init\_\_
556  
557  ```python
558  def __init__(*,
559               document_embedder: DocumentEmbedder,
560               sentences_per_group: int = 3,
561               percentile: float = 0.95,
562               min_length: int = 50,
563               max_length: int = 1000,
564               language: Language = "en",
565               use_split_rules: bool = True,
566               extend_abbreviations: bool = True)
567  ```
568  
569  Initialize EmbeddingBasedDocumentSplitter.
570  
571  **Arguments**:
572  
573  - `document_embedder`: The DocumentEmbedder to use for calculating embeddings.
574  - `sentences_per_group`: Number of sentences to group together before embedding.
575  - `percentile`: Percentile threshold for cosine distance. Distances above this percentile
576  are treated as break points.
577  - `min_length`: Minimum length of splits in characters. Splits below this length will be merged.
578  - `max_length`: Maximum length of splits in characters. Splits above this length will be recursively split.
579  - `language`: Language for sentence tokenization.
580  - `use_split_rules`: Whether to use additional split rules for sentence tokenization. Applies additional
581  split rules from SentenceSplitter to the sentence spans.
582  - `extend_abbreviations`: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list
583  of curated abbreviations. Currently supported languages are: en, de.
584  If False, the default abbreviations are used.
585  
586  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.warm_up"></a>
587  
588  #### EmbeddingBasedDocumentSplitter.warm\_up
589  
590  ```python
591  def warm_up() -> None
592  ```
593  
594  Warm up the component by initializing the sentence splitter.
595  
596  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.run"></a>
597  
598  #### EmbeddingBasedDocumentSplitter.run
599  
600  ```python
601  @component.output_types(documents=list[Document])
602  def run(documents: list[Document]) -> dict[str, list[Document]]
603  ```
604  
605  Split documents based on embedding similarity.
606  
607  **Arguments**:
608  
609  - `documents`: The documents to split.
610  
611  **Raises**:
612  
613  - `None`: - `RuntimeError`: If the component wasn't warmed up.
614  - `TypeError`: If the input is not a list of Documents.
615  - `ValueError`: If the document content is None or empty.
616  
617  **Returns**:
618  
619  A dictionary with the following key:
620  - `documents`: List of documents with the split texts. Each document includes:
621  - A metadata field `source_id` to track the original document.
622  - A metadata field `split_id` to track the split number.
623  - A metadata field `page_number` to track the original page number.
624  - All other metadata copied from the original document.
625  
626  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.to_dict"></a>
627  
628  #### EmbeddingBasedDocumentSplitter.to\_dict
629  
630  ```python
631  def to_dict() -> dict[str, Any]
632  ```
633  
634  Serializes the component to a dictionary.
635  
636  **Returns**:
637  
638  Serialized dictionary representation of the component.
639  
640  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.from_dict"></a>
641  
642  #### EmbeddingBasedDocumentSplitter.from\_dict
643  
644  ```python
645  @classmethod
646  def from_dict(cls, data: dict[str, Any]) -> "EmbeddingBasedDocumentSplitter"
647  ```
648  
649  Deserializes the component from a dictionary.
650  
651  **Arguments**:
652  
653  - `data`: The dictionary to deserialize and create the component.
654  
655  **Returns**:
656  
657  The deserialized component.
658  
659  <a id="hierarchical_document_splitter"></a>
660  
661  ## Module hierarchical\_document\_splitter
662  
663  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a>
664  
665  ### HierarchicalDocumentSplitter
666  
667  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
668  
669  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
670  are connected such that the smaller blocks are children of the parent-larger blocks.
671  
672  ## Usage example
673  ```python
674  from haystack import Document
675  from haystack.components.preprocessors import HierarchicalDocumentSplitter
676  
677  doc = Document(content="This is a simple test document")
678  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
679  splitter.run([doc])
680  >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
681  >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
682  >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
683  >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
684  >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
685  >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
686  >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
687  ```
688  
689  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a>
690  
691  #### HierarchicalDocumentSplitter.\_\_init\_\_
692  
693  ```python
694  def __init__(block_sizes: set[int],
695               split_overlap: int = 0,
696               split_by: Literal["word", "sentence", "page",
697                                 "passage"] = "word")
698  ```
699  
700  Initialize HierarchicalDocumentSplitter.
701  
702  **Arguments**:
703  
704  - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order.
705  - `split_overlap`: The number of overlapping units for each split.
706  - `split_by`: The unit for splitting your documents.
707  
708  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a>
709  
710  #### HierarchicalDocumentSplitter.run
711  
712  ```python
713  @component.output_types(documents=list[Document])
714  def run(documents: list[Document])
715  ```
716  
717  Builds a hierarchical document structure for each document in a list of documents.
718  
719  **Arguments**:
720  
721  - `documents`: List of Documents to split into hierarchical blocks.
722  
723  **Returns**:
724  
725  List of HierarchicalDocument
726  
727  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a>
728  
729  #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc
730  
731  ```python
732  def build_hierarchy_from_doc(document: Document) -> list[Document]
733  ```
734  
735  Build a hierarchical tree document structure from a single document.
736  
737  Given a document, this function splits the document into hierarchical blocks of different sizes represented
738  as HierarchicalDocument objects.
739  
740  **Arguments**:
741  
742  - `document`: Document to split into hierarchical blocks.
743  
744  **Returns**:
745  
746  List of HierarchicalDocument
747  
748  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a>
749  
750  #### HierarchicalDocumentSplitter.to\_dict
751  
752  ```python
753  def to_dict() -> dict[str, Any]
754  ```
755  
756  Returns a dictionary representation of the component.
757  
758  **Returns**:
759  
760  Serialized dictionary representation of the component.
761  
762  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a>
763  
764  #### HierarchicalDocumentSplitter.from\_dict
765  
766  ```python
767  @classmethod
768  def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter"
769  ```
770  
771  Deserialize this component from a dictionary.
772  
773  **Arguments**:
774  
775  - `data`: The dictionary to deserialize and create the component.
776  
777  **Returns**:
778  
779  The deserialized component.
780  
781  <a id="recursive_splitter"></a>
782  
783  ## Module recursive\_splitter
784  
785  <a id="recursive_splitter.RecursiveDocumentSplitter"></a>
786  
787  ### RecursiveDocumentSplitter
788  
789  Recursively chunk text into smaller chunks.
790  
791  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
792  to the text.
793  
794  The separators are applied in the order they are provided, typically this is a list of separators that are
795  applied in a specific order, being the last separator the most specific one.
796  
797  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
798  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
799  list to the remaining text.
800  
801  This is done until all chunks are smaller than the split_length parameter.
802  
803  **Example**:
804  
805    
806  ```python
807  from haystack import Document
808  from haystack.components.preprocessors import RecursiveDocumentSplitter
809  
810  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
811  text = ('''Artificial intelligence (AI) - Introduction
812  
813  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
814  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
815  chunker.warm_up()
816  doc = Document(content=text)
817  doc_chunks = chunker.run([doc])
818  print(doc_chunks["documents"])
819  >[
820  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
821  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
822  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
823  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
824  >]
825  ```
826  
827  <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a>
828  
829  #### RecursiveDocumentSplitter.\_\_init\_\_
830  
831  ```python
832  def __init__(*,
833               split_length: int = 200,
834               split_overlap: int = 0,
835               split_unit: Literal["word", "char", "token"] = "word",
836               separators: list[str] | None = None,
837               sentence_splitter_params: dict[str, Any] | None = None)
838  ```
839  
840  Initializes a RecursiveDocumentSplitter.
841  
842  **Arguments**:
843  
844  - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens.
845  See the `split_units` parameter.
846  - `split_overlap`: The number of characters to overlap between consecutive chunks.
847  - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token".
848  If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
849  - `separators`: An optional list of separator strings to use for splitting the text. The string
850  separators will be treated as regular expressions unless the separator is "sentence", in that case the
851  text will be split into sentences using a custom sentence tokenizer based on NLTK.
852  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
853  If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used.
854  - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer.
855  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
856  
857  **Raises**:
858  
859  - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
860  if any separator is not a string.
861  
862  <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a>
863  
864  #### RecursiveDocumentSplitter.warm\_up
865  
866  ```python
867  def warm_up() -> None
868  ```
869  
870  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
871  
872  <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a>
873  
874  #### RecursiveDocumentSplitter.run
875  
876  ```python
877  @component.output_types(documents=list[Document])
878  def run(documents: list[Document]) -> dict[str, list[Document]]
879  ```
880  
881  Split a list of documents into documents with smaller chunks of text.
882  
883  **Arguments**:
884  
885  - `documents`: List of Documents to split.
886  
887  **Returns**:
888  
889  A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
890  to the input documents.
891  
892  <a id="text_cleaner"></a>
893  
894  ## Module text\_cleaner
895  
896  <a id="text_cleaner.TextCleaner"></a>
897  
898  ### TextCleaner
899  
900  Cleans text strings.
901  
902  It can remove substrings matching a list of regular expressions, convert text to lowercase,
903  remove punctuation, and remove numbers.
904  Use it to clean up text data before evaluation.
905  
906  ### Usage example
907  
908  ```python
909  from haystack.components.preprocessors import TextCleaner
910  
911  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
912  
913  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
914  result = cleaner.run(texts=[text_to_clean])
915  ```
916  
917  <a id="text_cleaner.TextCleaner.__init__"></a>
918  
919  #### TextCleaner.\_\_init\_\_
920  
921  ```python
922  def __init__(remove_regexps: list[str] | None = None,
923               convert_to_lowercase: bool = False,
924               remove_punctuation: bool = False,
925               remove_numbers: bool = False)
926  ```
927  
928  Initializes the TextCleaner component.
929  
930  **Arguments**:
931  
932  - `remove_regexps`: A list of regex patterns to remove matching substrings from the text.
933  - `convert_to_lowercase`: If `True`, converts all characters to lowercase.
934  - `remove_punctuation`: If `True`, removes punctuation from the text.
935  - `remove_numbers`: If `True`, removes numerical digits from the text.
936  
937  <a id="text_cleaner.TextCleaner.run"></a>
938  
939  #### TextCleaner.run
940  
941  ```python
942  @component.output_types(texts=list[str])
943  def run(texts: list[str]) -> dict[str, Any]
944  ```
945  
946  Cleans up the given list of strings.
947  
948  **Arguments**:
949  
950  - `texts`: List of strings to clean.
951  
952  **Returns**:
953  
954  A dictionary with the following key:
955  - `texts`:  the cleaned list of strings.
956