Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.24 / haystack-api / preprocessors_api.md
preprocessors_api.md
   1  ---
   2  title: "PreProcessors"
   3  id: preprocessors-api
   4  description: "Preprocess your Documents and texts. Clean, split, and more."
   5  slug: "/preprocessors-api"
   6  ---
   7  
   8  <a id="csv_document_cleaner"></a>
   9  
  10  ## Module csv\_document\_cleaner
  11  
  12  <a id="csv_document_cleaner.CSVDocumentCleaner"></a>
  13  
  14  ### CSVDocumentCleaner
  15  
  16  A component for cleaning CSV documents by removing empty rows and columns.
  17  
  18  This component processes CSV content stored in Documents, allowing
  19  for the optional ignoring of a specified number of rows and columns before performing
  20  the cleaning operation. Additionally, it provides options to keep document IDs and
  21  control whether empty rows and columns should be removed.
  22  
  23  <a id="csv_document_cleaner.CSVDocumentCleaner.__init__"></a>
  24  
  25  #### CSVDocumentCleaner.\_\_init\_\_
  26  
  27  ```python
  28  def __init__(*,
  29               ignore_rows: int = 0,
  30               ignore_columns: int = 0,
  31               remove_empty_rows: bool = True,
  32               remove_empty_columns: bool = True,
  33               keep_id: bool = False) -> None
  34  ```
  35  
  36  Initializes the CSVDocumentCleaner component.
  37  
  38  **Arguments**:
  39  
  40  - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing.
  41  - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing.
  42  - `remove_empty_rows`: Whether to remove rows that are entirely empty.
  43  - `remove_empty_columns`: Whether to remove columns that are entirely empty.
  44  - `keep_id`: Whether to retain the original document ID in the output document.
  45  Rows and columns ignored using these parameters are preserved in the final output, meaning
  46  they are not considered when removing empty rows and columns.
  47  
  48  <a id="csv_document_cleaner.CSVDocumentCleaner.run"></a>
  49  
  50  #### CSVDocumentCleaner.run
  51  
  52  ```python
  53  @component.output_types(documents=list[Document])
  54  def run(documents: list[Document]) -> dict[str, list[Document]]
  55  ```
  56  
  57  Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns.
  58  
  59  **Arguments**:
  60  
  61  - `documents`: List of Documents containing CSV-formatted content.
  62  
  63  **Returns**:
  64  
  65  A dictionary with a list of cleaned Documents under the key "documents".
  66  Processing steps:
  67  1. Reads each document's content as a CSV table.
  68  2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
  69  3. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and
  70      `remove_empty_columns`).
  71  4. Reattaches the ignored rows and columns to maintain their original positions.
  72  5. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original
  73      document ID.
  74  
  75  <a id="csv_document_splitter"></a>
  76  
  77  ## Module csv\_document\_splitter
  78  
  79  <a id="csv_document_splitter.CSVDocumentSplitter"></a>
  80  
  81  ### CSVDocumentSplitter
  82  
  83  A component for splitting CSV documents into sub-tables based on split arguments.
  84  
  85  The splitter supports two modes of operation:
  86  - identify consecutive empty rows or columns that exceed a given threshold
  87  and uses them as delimiters to segment the document into smaller tables.
  88  - split each row into a separate sub-table, represented as a Document.
  89  
  90  <a id="csv_document_splitter.CSVDocumentSplitter.__init__"></a>
  91  
  92  #### CSVDocumentSplitter.\_\_init\_\_
  93  
  94  ```python
  95  def __init__(row_split_threshold: int | None = 2,
  96               column_split_threshold: int | None = 2,
  97               read_csv_kwargs: dict[str, Any] | None = None,
  98               split_mode: SplitMode = "threshold") -> None
  99  ```
 100  
 101  Initializes the CSVDocumentSplitter component.
 102  
 103  **Arguments**:
 104  
 105  - `row_split_threshold`: The minimum number of consecutive empty rows required to trigger a split.
 106  - `column_split_threshold`: The minimum number of consecutive empty columns required to trigger a split.
 107  - `read_csv_kwargs`: Additional keyword arguments to pass to `pandas.read_csv`.
 108  By default, the component with options:
 109  - `header=None`
 110  - `skip_blank_lines=False` to preserve blank lines
 111  - `dtype=object` to prevent type inference (e.g., converting numbers to floats).
 112  See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information.
 113  - `split_mode`: If `threshold`, the component will split the document based on the number of
 114  consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`.
 115  If `row-wise`, the component will split each row into a separate sub-table.
 116  
 117  <a id="csv_document_splitter.CSVDocumentSplitter.run"></a>
 118  
 119  #### CSVDocumentSplitter.run
 120  
 121  ```python
 122  @component.output_types(documents=list[Document])
 123  def run(documents: list[Document]) -> dict[str, list[Document]]
 124  ```
 125  
 126  Processes and splits a list of CSV documents into multiple sub-tables.
 127  
 128  **Splitting Process:**
 129  1. Applies a row-based split if `row_split_threshold` is provided.
 130  2. Applies a column-based split if `column_split_threshold` is provided.
 131  3. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring
 132     further fragmentation of any sub-tables that still contain empty sections.
 133  4. Sorts the resulting sub-tables based on their original positions within the document.
 134  
 135  **Arguments**:
 136  
 137  - `documents`: A list of Documents containing CSV-formatted content.
 138  Each document is assumed to contain one or more tables separated by empty rows or columns.
 139  
 140  **Returns**:
 141  
 142  A dictionary with a key `"documents"`, mapping to a list of new `Document` objects,
 143  each representing an extracted sub-table from the original CSV.
 144      The metadata of each document includes:
 145          - A field `source_id` to track the original document.
 146          - A field `row_idx_start` to indicate the starting row index of the sub-table in the original table.
 147          - A field `col_idx_start` to indicate the starting column index of the sub-table in the original table.
 148          - A field `split_id` to indicate the order of the split in the original document.
 149          - All other metadata copied from the original document.
 150  
 151  - If a document cannot be processed, it is returned unchanged.
 152  - The `meta` field from the original document is preserved in the split documents.
 153  
 154  <a id="document_cleaner"></a>
 155  
 156  ## Module document\_cleaner
 157  
 158  <a id="document_cleaner.DocumentCleaner"></a>
 159  
 160  ### DocumentCleaner
 161  
 162  Cleans the text in the documents.
 163  
 164  It removes extra whitespaces,
 165  empty lines, specified substrings, regexes,
 166  page headers and footers (in this order).
 167  
 168  ### Usage example:
 169  
 170  ```python
 171  from haystack import Document
 172  from haystack.components.preprocessors import DocumentCleaner
 173  
 174  doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")
 175  
 176  cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
 177  result = cleaner.run(documents=[doc])
 178  
 179  assert result["documents"][0].content == "This is a document to clean "
 180  ```
 181  
 182  <a id="document_cleaner.DocumentCleaner.__init__"></a>
 183  
 184  #### DocumentCleaner.\_\_init\_\_
 185  
 186  ```python
 187  def __init__(remove_empty_lines: bool = True,
 188               remove_extra_whitespaces: bool = True,
 189               remove_repeated_substrings: bool = False,
 190               keep_id: bool = False,
 191               remove_substrings: list[str] | None = None,
 192               remove_regex: str | None = None,
 193               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
 194               | None = None,
 195               ascii_only: bool = False,
 196               strip_whitespaces: bool = False,
 197               replace_regexes: dict[str, str] | None = None)
 198  ```
 199  
 200  Initialize DocumentCleaner.
 201  
 202  **Arguments**:
 203  
 204  - `remove_empty_lines`: If `True`, removes empty lines.
 205  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
 206  - `remove_repeated_substrings`: If `True`, removes repeated substrings (headers and footers) from pages.
 207  Pages must be separated by a form feed character "\f",
 208  which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`.
 209  - `remove_substrings`: List of substrings to remove from the text.
 210  - `remove_regex`: Regex to match and replace substrings by "".
 211  - `keep_id`: If `True`, keeps the IDs of the original documents.
 212  - `unicode_normalization`: Unicode normalization form to apply to the text.
 213  Note: This will run before any other steps.
 214  - `ascii_only`: Whether to convert the text to ASCII only.
 215  Will remove accents from characters and replace them with ASCII characters.
 216  Other non-ASCII characters will be removed.
 217  Note: This will run before any pattern matching or removal.
 218  - `strip_whitespaces`: If `True`, removes leading and trailing whitespace from the document content
 219  using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning
 220  and end of the text, preserving internal whitespace (useful for markdown formatting).
 221  - `replace_regexes`: A dictionary mapping regex patterns to their replacement strings.
 222  For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline.
 223  This is applied after `remove_regex` and allows custom replacements instead of just removal.
 224  
 225  <a id="document_cleaner.DocumentCleaner.run"></a>
 226  
 227  #### DocumentCleaner.run
 228  
 229  ```python
 230  @component.output_types(documents=list[Document])
 231  def run(documents: list[Document])
 232  ```
 233  
 234  Cleans up the documents.
 235  
 236  **Arguments**:
 237  
 238  - `documents`: List of Documents to clean.
 239  
 240  **Raises**:
 241  
 242  - `TypeError`: if documents is not a list of Documents.
 243  
 244  **Returns**:
 245  
 246  A dictionary with the following key:
 247  - `documents`: List of cleaned Documents.
 248  
 249  <a id="document_preprocessor"></a>
 250  
 251  ## Module document\_preprocessor
 252  
 253  <a id="document_preprocessor.DocumentPreprocessor"></a>
 254  
 255  ### DocumentPreprocessor
 256  
 257  A SuperComponent that first splits and then cleans documents.
 258  
 259  This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline.
 260  It takes a list of documents as input and returns a processed list of documents.
 261  
 262  Usage example:
 263  ```python
 264  from haystack import Document
 265  from haystack.components.preprocessors import DocumentPreprocessor
 266  
 267  doc = Document(content="I love pizza!")
 268  preprocessor = DocumentPreprocessor()
 269  result = preprocessor.run(documents=[doc])
 270  print(result["documents"])
 271  ```
 272  
 273  <a id="document_preprocessor.DocumentPreprocessor.__init__"></a>
 274  
 275  #### DocumentPreprocessor.\_\_init\_\_
 276  
 277  ```python
 278  def __init__(*,
 279               split_by: Literal["function", "page", "passage", "period", "word",
 280                                 "line", "sentence"] = "word",
 281               split_length: int = 250,
 282               split_overlap: int = 0,
 283               split_threshold: int = 0,
 284               splitting_function: Callable[[str], list[str]] | None = None,
 285               respect_sentence_boundary: bool = False,
 286               language: Language = "en",
 287               use_split_rules: bool = True,
 288               extend_abbreviations: bool = True,
 289               remove_empty_lines: bool = True,
 290               remove_extra_whitespaces: bool = True,
 291               remove_repeated_substrings: bool = False,
 292               keep_id: bool = False,
 293               remove_substrings: list[str] | None = None,
 294               remove_regex: str | None = None,
 295               unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"]
 296               | None = None,
 297               ascii_only: bool = False) -> None
 298  ```
 299  
 300  Initialize a DocumentPreProcessor that first splits and then cleans documents.
 301  
 302  **Splitter Parameters**:
 303  
 304  **Arguments**:
 305  
 306  - `split_by`: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
 307  - `split_length`: The maximum number of units (words, lines, pages, and so on) in each split.
 308  - `split_overlap`: The number of overlapping units between consecutive splits.
 309  - `split_threshold`: The minimum number of units per split. If a split is smaller than this, it's merged
 310  with the previous split.
 311  - `splitting_function`: A custom function for splitting if `split_by="function"`.
 312  - `respect_sentence_boundary`: If `True`, splits by words but tries not to break inside a sentence.
 313  - `language`: Language used by the sentence tokenizer if `split_by="sentence"` or
 314  `respect_sentence_boundary=True`.
 315  - `use_split_rules`: Whether to apply additional splitting heuristics for the sentence splitter.
 316  - `extend_abbreviations`: Whether to extend the sentence splitter with curated abbreviations for certain
 317  languages.
 318  
 319  **Cleaner Parameters**:
 320  - `remove_empty_lines`: If `True`, removes empty lines.
 321  - `remove_extra_whitespaces`: If `True`, removes extra whitespaces.
 322  - `remove_repeated_substrings`: If `True`, removes repeated substrings like headers/footers across pages.
 323  - `keep_id`: If `True`, keeps the original document IDs.
 324  - `remove_substrings`: A list of strings to remove from the document content.
 325  - `remove_regex`: A regex pattern whose matches will be removed from the document content.
 326  - `unicode_normalization`: Unicode normalization form to apply to the text, for example `"NFC"`.
 327  - `ascii_only`: If `True`, converts text to ASCII only.
 328  
 329  <a id="document_preprocessor.DocumentPreprocessor.to_dict"></a>
 330  
 331  #### DocumentPreprocessor.to\_dict
 332  
 333  ```python
 334  def to_dict() -> dict[str, Any]
 335  ```
 336  
 337  Serialize SuperComponent to a dictionary.
 338  
 339  **Returns**:
 340  
 341  Dictionary with serialized data.
 342  
 343  <a id="document_preprocessor.DocumentPreprocessor.from_dict"></a>
 344  
 345  #### DocumentPreprocessor.from\_dict
 346  
 347  ```python
 348  @classmethod
 349  def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor"
 350  ```
 351  
 352  Deserializes the SuperComponent from a dictionary.
 353  
 354  **Arguments**:
 355  
 356  - `data`: Dictionary to deserialize from.
 357  
 358  **Returns**:
 359  
 360  Deserialized SuperComponent.
 361  
 362  <a id="document_splitter"></a>
 363  
 364  ## Module document\_splitter
 365  
 366  <a id="document_splitter.DocumentSplitter"></a>
 367  
 368  ### DocumentSplitter
 369  
 370  Splits long documents into smaller chunks.
 371  
 372  This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations
 373  and prevents exceeding language model context limits.
 374  
 375  The DocumentSplitter is compatible with the following DocumentStores:
 376  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
 377  - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is
 378    not stored
 379  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
 380  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
 381  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
 382  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping
 383     information is not stored
 384  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
 385  - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore)
 386  
 387  ### Usage example
 388  
 389  ```python
 390  from haystack import Document
 391  from haystack.components.preprocessors import DocumentSplitter
 392  
 393  doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
 394  
 395  splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
 396  result = splitter.run(documents=[doc])
 397  ```
 398  
 399  <a id="document_splitter.DocumentSplitter.__init__"></a>
 400  
 401  #### DocumentSplitter.\_\_init\_\_
 402  
 403  ```python
 404  def __init__(split_by: Literal["function", "page", "passage", "period", "word",
 405                                 "line", "sentence"] = "word",
 406               split_length: int = 200,
 407               split_overlap: int = 0,
 408               split_threshold: int = 0,
 409               splitting_function: Callable[[str], list[str]] | None = None,
 410               respect_sentence_boundary: bool = False,
 411               language: Language = "en",
 412               use_split_rules: bool = True,
 413               extend_abbreviations: bool = True,
 414               *,
 415               skip_empty_documents: bool = True)
 416  ```
 417  
 418  Initialize DocumentSplitter.
 419  
 420  **Arguments**:
 421  
 422  - `split_by`: The unit for splitting your documents. Choose from:
 423  - `word` for splitting by spaces (" ")
 424  - `period` for splitting by periods (".")
 425  - `page` for splitting by form feed ("\f")
 426  - `passage` for splitting by double line breaks ("\n\n")
 427  - `line` for splitting each line ("\n")
 428  - `sentence` for splitting by NLTK sentence tokenizer
 429  - `split_length`: The maximum number of units in each split.
 430  - `split_overlap`: The number of overlapping units for each split.
 431  - `split_threshold`: The minimum number of units per split. If a split has fewer units
 432  than the threshold, it's attached to the previous split.
 433  - `splitting_function`: Necessary when `split_by` is set to "function".
 434  This is a function which must accept a single `str` as input and return a `list` of `str` as output,
 435  representing the chunks after splitting.
 436  - `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
 437  If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
 438  - `language`: Choose the language for the NLTK tokenizer. The default is English ("en").
 439  - `use_split_rules`: Choose whether to use additional split rules when splitting by `sentence`.
 440  - `extend_abbreviations`: Choose whether to extend NLTK's PunktTokenizer abbreviations with a list
 441  of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
 442  - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True.
 443  Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
 444  from non-textual documents.
 445  
 446  <a id="document_splitter.DocumentSplitter.warm_up"></a>
 447  
 448  #### DocumentSplitter.warm\_up
 449  
 450  ```python
 451  def warm_up()
 452  ```
 453  
 454  Warm up the DocumentSplitter by loading the sentence tokenizer.
 455  
 456  <a id="document_splitter.DocumentSplitter.run"></a>
 457  
 458  #### DocumentSplitter.run
 459  
 460  ```python
 461  @component.output_types(documents=list[Document])
 462  def run(documents: list[Document])
 463  ```
 464  
 465  Split documents into smaller parts.
 466  
 467  Splits documents by the unit expressed in `split_by`, with a length of `split_length`
 468  and an overlap of `split_overlap`.
 469  
 470  **Arguments**:
 471  
 472  - `documents`: The documents to split.
 473  
 474  **Raises**:
 475  
 476  - `TypeError`: if the input is not a list of Documents.
 477  - `ValueError`: if the content of a document is None.
 478  
 479  **Returns**:
 480  
 481  A dictionary with the following key:
 482  - `documents`: List of documents with the split texts. Each document includes:
 483  - A metadata field `source_id` to track the original document.
 484  - A metadata field `page_number` to track the original page number.
 485  - All other metadata copied from the original document.
 486  
 487  <a id="document_splitter.DocumentSplitter.to_dict"></a>
 488  
 489  #### DocumentSplitter.to\_dict
 490  
 491  ```python
 492  def to_dict() -> dict[str, Any]
 493  ```
 494  
 495  Serializes the component to a dictionary.
 496  
 497  <a id="document_splitter.DocumentSplitter.from_dict"></a>
 498  
 499  #### DocumentSplitter.from\_dict
 500  
 501  ```python
 502  @classmethod
 503  def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter"
 504  ```
 505  
 506  Deserializes the component from a dictionary.
 507  
 508  <a id="embedding_based_document_splitter"></a>
 509  
 510  ## Module embedding\_based\_document\_splitter
 511  
 512  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter"></a>
 513  
 514  ### EmbeddingBasedDocumentSplitter
 515  
 516  Splits documents based on embedding similarity using cosine distances between sequential sentence groups.
 517  
 518  This component first splits text into sentences, optionally groups them, calculates embeddings for each group,
 519  and then uses cosine distance between sequential embeddings to determine split points. Any distance above
 520  the specified percentile is treated as a break point. The component also tracks page numbers based on form feed
 521  characters (``) in the original document.
 522  
 523  This component is inspired by [5 Levels of Text Splitting](
 524      https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
 525  ) by Greg Kamradt.
 526  
 527  ### Usage example
 528  
 529  ```python
 530  from haystack import Document
 531  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
 532  from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
 533  
 534  # Create a document with content that has a clear topic shift
 535  doc = Document(
 536      content="This is a first sentence. This is a second sentence. This is a third sentence. "
 537      "Completely different topic. The same completely different topic."
 538  )
 539  
 540  # Initialize the embedder to calculate semantic similarities
 541  embedder = SentenceTransformersDocumentEmbedder()
 542  
 543  # Configure the splitter with parameters that control splitting behavior
 544  splitter = EmbeddingBasedDocumentSplitter(
 545      document_embedder=embedder,
 546      sentences_per_group=2,      # Group 2 sentences before calculating embeddings
 547      percentile=0.95,            # Split when cosine distance exceeds 95th percentile
 548      min_length=50,              # Merge splits shorter than 50 characters
 549      max_length=1000             # Further split chunks longer than 1000 characters
 550  )
 551  splitter.warm_up()
 552  result = splitter.run(documents=[doc])
 553  
 554  # The result contains a list of Document objects, each representing a semantic chunk
 555  # Each split document includes metadata: source_id, split_id, and page_number
 556  print(f"Original document split into {len(result['documents'])} chunks")
 557  for i, split_doc in enumerate(result['documents']):
 558      print(f"Chunk {i}: {split_doc.content[:50]}...")
 559  ```
 560  
 561  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.__init__"></a>
 562  
 563  #### EmbeddingBasedDocumentSplitter.\_\_init\_\_
 564  
 565  ```python
 566  def __init__(*,
 567               document_embedder: DocumentEmbedder,
 568               sentences_per_group: int = 3,
 569               percentile: float = 0.95,
 570               min_length: int = 50,
 571               max_length: int = 1000,
 572               language: Language = "en",
 573               use_split_rules: bool = True,
 574               extend_abbreviations: bool = True)
 575  ```
 576  
 577  Initialize EmbeddingBasedDocumentSplitter.
 578  
 579  **Arguments**:
 580  
 581  - `document_embedder`: The DocumentEmbedder to use for calculating embeddings.
 582  - `sentences_per_group`: Number of sentences to group together before embedding.
 583  - `percentile`: Percentile threshold for cosine distance. Distances above this percentile
 584  are treated as break points.
 585  - `min_length`: Minimum length of splits in characters. Splits below this length will be merged.
 586  - `max_length`: Maximum length of splits in characters. Splits above this length will be recursively split.
 587  - `language`: Language for sentence tokenization.
 588  - `use_split_rules`: Whether to use additional split rules for sentence tokenization. Applies additional
 589  split rules from SentenceSplitter to the sentence spans.
 590  - `extend_abbreviations`: If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list
 591  of curated abbreviations. Currently supported languages are: en, de.
 592  If False, the default abbreviations are used.
 593  
 594  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.warm_up"></a>
 595  
 596  #### EmbeddingBasedDocumentSplitter.warm\_up
 597  
 598  ```python
 599  def warm_up() -> None
 600  ```
 601  
 602  Warm up the component by initializing the sentence splitter.
 603  
 604  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.run"></a>
 605  
 606  #### EmbeddingBasedDocumentSplitter.run
 607  
 608  ```python
 609  @component.output_types(documents=list[Document])
 610  def run(documents: list[Document]) -> dict[str, list[Document]]
 611  ```
 612  
 613  Split documents based on embedding similarity.
 614  
 615  **Arguments**:
 616  
 617  - `documents`: The documents to split.
 618  
 619  **Raises**:
 620  
 621  - `None`: - `RuntimeError`: If the component wasn't warmed up.
 622  - `TypeError`: If the input is not a list of Documents.
 623  - `ValueError`: If the document content is None or empty.
 624  
 625  **Returns**:
 626  
 627  A dictionary with the following key:
 628  - `documents`: List of documents with the split texts. Each document includes:
 629  - A metadata field `source_id` to track the original document.
 630  - A metadata field `split_id` to track the split number.
 631  - A metadata field `page_number` to track the original page number.
 632  - All other metadata copied from the original document.
 633  
 634  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.to_dict"></a>
 635  
 636  #### EmbeddingBasedDocumentSplitter.to\_dict
 637  
 638  ```python
 639  def to_dict() -> dict[str, Any]
 640  ```
 641  
 642  Serializes the component to a dictionary.
 643  
 644  **Returns**:
 645  
 646  Serialized dictionary representation of the component.
 647  
 648  <a id="embedding_based_document_splitter.EmbeddingBasedDocumentSplitter.from_dict"></a>
 649  
 650  #### EmbeddingBasedDocumentSplitter.from\_dict
 651  
 652  ```python
 653  @classmethod
 654  def from_dict(cls, data: dict[str, Any]) -> "EmbeddingBasedDocumentSplitter"
 655  ```
 656  
 657  Deserializes the component from a dictionary.
 658  
 659  **Arguments**:
 660  
 661  - `data`: The dictionary to deserialize and create the component.
 662  
 663  **Returns**:
 664  
 665  The deserialized component.
 666  
 667  <a id="hierarchical_document_splitter"></a>
 668  
 669  ## Module hierarchical\_document\_splitter
 670  
 671  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter"></a>
 672  
 673  ### HierarchicalDocumentSplitter
 674  
 675  Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
 676  
 677  The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between
 678  are connected such that the smaller blocks are children of the parent-larger blocks.
 679  
 680  ## Usage example
 681  ```python
 682  from haystack import Document
 683  from haystack.components.preprocessors import HierarchicalDocumentSplitter
 684  
 685  doc = Document(content="This is a simple test document")
 686  splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
 687  splitter.run([doc])
 688  >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
 689  >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
 690  >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
 691  >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
 692  >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
 693  >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
 694  >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
 695  ```
 696  
 697  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.__init__"></a>
 698  
 699  #### HierarchicalDocumentSplitter.\_\_init\_\_
 700  
 701  ```python
 702  def __init__(block_sizes: set[int],
 703               split_overlap: int = 0,
 704               split_by: Literal["word", "sentence", "page",
 705                                 "passage"] = "word")
 706  ```
 707  
 708  Initialize HierarchicalDocumentSplitter.
 709  
 710  **Arguments**:
 711  
 712  - `block_sizes`: Set of block sizes to split the document into. The blocks are split in descending order.
 713  - `split_overlap`: The number of overlapping units for each split.
 714  - `split_by`: The unit for splitting your documents.
 715  
 716  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.run"></a>
 717  
 718  #### HierarchicalDocumentSplitter.run
 719  
 720  ```python
 721  @component.output_types(documents=list[Document])
 722  def run(documents: list[Document])
 723  ```
 724  
 725  Builds a hierarchical document structure for each document in a list of documents.
 726  
 727  **Arguments**:
 728  
 729  - `documents`: List of Documents to split into hierarchical blocks.
 730  
 731  **Returns**:
 732  
 733  List of HierarchicalDocument
 734  
 735  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.build_hierarchy_from_doc"></a>
 736  
 737  #### HierarchicalDocumentSplitter.build\_hierarchy\_from\_doc
 738  
 739  ```python
 740  def build_hierarchy_from_doc(document: Document) -> list[Document]
 741  ```
 742  
 743  Build a hierarchical tree document structure from a single document.
 744  
 745  Given a document, this function splits the document into hierarchical blocks of different sizes represented
 746  as HierarchicalDocument objects.
 747  
 748  **Arguments**:
 749  
 750  - `document`: Document to split into hierarchical blocks.
 751  
 752  **Returns**:
 753  
 754  List of HierarchicalDocument
 755  
 756  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.to_dict"></a>
 757  
 758  #### HierarchicalDocumentSplitter.to\_dict
 759  
 760  ```python
 761  def to_dict() -> dict[str, Any]
 762  ```
 763  
 764  Returns a dictionary representation of the component.
 765  
 766  **Returns**:
 767  
 768  Serialized dictionary representation of the component.
 769  
 770  <a id="hierarchical_document_splitter.HierarchicalDocumentSplitter.from_dict"></a>
 771  
 772  #### HierarchicalDocumentSplitter.from\_dict
 773  
 774  ```python
 775  @classmethod
 776  def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter"
 777  ```
 778  
 779  Deserialize this component from a dictionary.
 780  
 781  **Arguments**:
 782  
 783  - `data`: The dictionary to deserialize and create the component.
 784  
 785  **Returns**:
 786  
 787  The deserialized component.
 788  
 789  <a id="markdown_header_splitter"></a>
 790  
 791  ## Module markdown\_header\_splitter
 792  
 793  <a id="markdown_header_splitter.MarkdownHeaderSplitter"></a>
 794  
 795  ### MarkdownHeaderSplitter
 796  
 797  Split documents at ATX-style Markdown headers (#), with optional secondary splitting.
 798  
 799  This component processes text documents by:
 800  - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata.
 801  - Optionally applying a secondary split (by word, passage, period, or line) to each chunk
 802    (using haystack's DocumentSplitter).
 803  - Preserving and propagating metadata such as parent headers, page numbers, and split IDs.
 804  
 805  <a id="markdown_header_splitter.MarkdownHeaderSplitter.__init__"></a>
 806  
 807  #### MarkdownHeaderSplitter.\_\_init\_\_
 808  
 809  ```python
 810  def __init__(*,
 811               page_break_character: str = "\f",
 812               keep_headers: bool = True,
 813               secondary_split: Literal["word", "passage", "period", "line"]
 814               | None = None,
 815               split_length: int = 200,
 816               split_overlap: int = 0,
 817               split_threshold: int = 0,
 818               skip_empty_documents: bool = True)
 819  ```
 820  
 821  Initialize the MarkdownHeaderSplitter.
 822  
 823  **Arguments**:
 824  
 825  - `page_break_character`: Character used to identify page breaks. Defaults to form feed ("").
 826  - `keep_headers`: If True, headers are kept in the content. If False, headers are moved to metadata.
 827  Defaults to True.
 828  - `secondary_split`: Optional secondary split condition after header splitting.
 829  Options are None, "word", "passage", "period", "line". Defaults to None.
 830  - `split_length`: The maximum number of units in each split when using secondary splitting. Defaults to 200.
 831  - `split_overlap`: The number of overlapping units for each split when using secondary splitting.
 832  Defaults to 0.
 833  - `split_threshold`: The minimum number of units per split when using secondary splitting. Defaults to 0.
 834  - `skip_empty_documents`: Choose whether to skip documents with empty content. Default is True.
 835  Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
 836  from non-textual documents.
 837  
 838  <a id="markdown_header_splitter.MarkdownHeaderSplitter.warm_up"></a>
 839  
 840  #### MarkdownHeaderSplitter.warm\_up
 841  
 842  ```python
 843  def warm_up()
 844  ```
 845  
 846  Warm up the MarkdownHeaderSplitter.
 847  
 848  <a id="markdown_header_splitter.MarkdownHeaderSplitter.run"></a>
 849  
 850  #### MarkdownHeaderSplitter.run
 851  
 852  ```python
 853  @component.output_types(documents=list[Document])
 854  def run(documents: list[Document]) -> dict[str, list[Document]]
 855  ```
 856  
 857  Run the markdown header splitter with optional secondary splitting.
 858  
 859  **Arguments**:
 860  
 861  - `documents`: List of documents to split
 862  
 863  **Returns**:
 864  
 865  A dictionary with the following key:
 866  - `documents`: List of documents with the split texts. Each document includes:
 867  - A metadata field `source_id` to track the original document.
 868  - A metadata field `page_number` to track the original page number.
 869  - A metadata field `split_id` to identify the split chunk index within its parent document.
 870  - All other metadata copied from the original document.
 871  
 872  <a id="recursive_splitter"></a>
 873  
 874  ## Module recursive\_splitter
 875  
 876  <a id="recursive_splitter.RecursiveDocumentSplitter"></a>
 877  
 878  ### RecursiveDocumentSplitter
 879  
 880  Recursively chunk text into smaller chunks.
 881  
 882  This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
 883  to the text.
 884  
 885  The separators are applied in the order they are provided, typically this is a list of separators that are
 886  applied in a specific order, being the last separator the most specific one.
 887  
 888  Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
 889  are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the
 890  list to the remaining text.
 891  
 892  This is done until all chunks are smaller than the split_length parameter.
 893  
 894  **Example**:
 895  
 896    
 897  ```python
 898  from haystack import Document
 899  from haystack.components.preprocessors import RecursiveDocumentSplitter
 900  
 901  chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
 902  text = ('''Artificial intelligence (AI) - Introduction
 903  
 904  AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
 905  AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
 906  chunker.warm_up()
 907  doc = Document(content=text)
 908  doc_chunks = chunker.run([doc])
 909  print(doc_chunks["documents"])
 910  >[
 911  >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
 912  >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
 913  >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
 914  >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
 915  >]
 916  ```
 917  
 918  <a id="recursive_splitter.RecursiveDocumentSplitter.__init__"></a>
 919  
 920  #### RecursiveDocumentSplitter.\_\_init\_\_
 921  
 922  ```python
 923  def __init__(*,
 924               split_length: int = 200,
 925               split_overlap: int = 0,
 926               split_unit: Literal["word", "char", "token"] = "word",
 927               separators: list[str] | None = None,
 928               sentence_splitter_params: dict[str, Any] | None = None)
 929  ```
 930  
 931  Initializes a RecursiveDocumentSplitter.
 932  
 933  **Arguments**:
 934  
 935  - `split_length`: The maximum length of each chunk by default in words, but can be in characters or tokens.
 936  See the `split_units` parameter.
 937  - `split_overlap`: The number of characters to overlap between consecutive chunks.
 938  - `split_unit`: The unit of the split_length parameter. It can be either "word", "char", or "token".
 939  If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
 940  - `separators`: An optional list of separator strings to use for splitting the text. The string
 941  separators will be treated as regular expressions unless the separator is "sentence", in that case the
 942  text will be split into sentences using a custom sentence tokenizer based on NLTK.
 943  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter.
 944  If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used.
 945  - `sentence_splitter_params`: Optional parameters to pass to the sentence tokenizer.
 946  See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information.
 947  
 948  **Raises**:
 949  
 950  - `ValueError`: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
 951  if any separator is not a string.
 952  
 953  <a id="recursive_splitter.RecursiveDocumentSplitter.warm_up"></a>
 954  
 955  #### RecursiveDocumentSplitter.warm\_up
 956  
 957  ```python
 958  def warm_up() -> None
 959  ```
 960  
 961  Warm up the sentence tokenizer and tiktoken tokenizer if needed.
 962  
 963  <a id="recursive_splitter.RecursiveDocumentSplitter.run"></a>
 964  
 965  #### RecursiveDocumentSplitter.run
 966  
 967  ```python
 968  @component.output_types(documents=list[Document])
 969  def run(documents: list[Document]) -> dict[str, list[Document]]
 970  ```
 971  
 972  Split a list of documents into documents with smaller chunks of text.
 973  
 974  **Arguments**:
 975  
 976  - `documents`: List of Documents to split.
 977  
 978  **Returns**:
 979  
 980  A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
 981  to the input documents.
 982  
 983  <a id="text_cleaner"></a>
 984  
 985  ## Module text\_cleaner
 986  
 987  <a id="text_cleaner.TextCleaner"></a>
 988  
 989  ### TextCleaner
 990  
 991  Cleans text strings.
 992  
 993  It can remove substrings matching a list of regular expressions, convert text to lowercase,
 994  remove punctuation, and remove numbers.
 995  Use it to clean up text data before evaluation.
 996  
 997  ### Usage example
 998  
 999  ```python
1000  from haystack.components.preprocessors import TextCleaner
1001  
1002  text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
1003  
1004  cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
1005  result = cleaner.run(texts=[text_to_clean])
1006  ```
1007  
1008  <a id="text_cleaner.TextCleaner.__init__"></a>
1009  
1010  #### TextCleaner.\_\_init\_\_
1011  
1012  ```python
1013  def __init__(remove_regexps: list[str] | None = None,
1014               convert_to_lowercase: bool = False,
1015               remove_punctuation: bool = False,
1016               remove_numbers: bool = False)
1017  ```
1018  
1019  Initializes the TextCleaner component.
1020  
1021  **Arguments**:
1022  
1023  - `remove_regexps`: A list of regex patterns to remove matching substrings from the text.
1024  - `convert_to_lowercase`: If `True`, converts all characters to lowercase.
1025  - `remove_punctuation`: If `True`, removes punctuation from the text.
1026  - `remove_numbers`: If `True`, removes numerical digits from the text.
1027  
1028  <a id="text_cleaner.TextCleaner.run"></a>
1029  
1030  #### TextCleaner.run
1031  
1032  ```python
1033  @component.output_types(texts=list[str])
1034  def run(texts: list[str]) -> dict[str, Any]
1035  ```
1036  
1037  Cleans up the given list of strings.
1038  
1039  **Arguments**:
1040  
1041  - `texts`: List of strings to clean.
1042  
1043  **Returns**:
1044  
1045  A dictionary with the following key:
1046  - `texts`:  the cleaned list of strings.
1047