preprocessors_api.md
1 --- 2 title: "PreProcessors" 3 id: preprocessors-api 4 description: "Preprocess your Documents and texts. Clean, split, and more." 5 slug: "/preprocessors-api" 6 --- 7 8 9 ## csv_document_cleaner 10 11 ### CSVDocumentCleaner 12 13 A component for cleaning CSV documents by removing empty rows and columns. 14 15 This component processes CSV content stored in Documents, allowing 16 for the optional ignoring of a specified number of rows and columns before performing 17 the cleaning operation. Additionally, it provides options to keep document IDs and 18 control whether empty rows and columns should be removed. 19 20 #### __init__ 21 22 ```python 23 __init__( 24 *, 25 ignore_rows: int = 0, 26 ignore_columns: int = 0, 27 remove_empty_rows: bool = True, 28 remove_empty_columns: bool = True, 29 keep_id: bool = False 30 ) -> None 31 ``` 32 33 Initializes the CSVDocumentCleaner component. 34 35 **Parameters:** 36 37 - **ignore_rows** (<code>int</code>) – Number of rows to ignore from the top of the CSV table before processing. 38 - **ignore_columns** (<code>int</code>) – Number of columns to ignore from the left of the CSV table before processing. 39 - **remove_empty_rows** (<code>bool</code>) – Whether to remove rows that are entirely empty. 40 - **remove_empty_columns** (<code>bool</code>) – Whether to remove columns that are entirely empty. 41 - **keep_id** (<code>bool</code>) – Whether to retain the original document ID in the output document. 42 43 Rows and columns ignored using these parameters are preserved in the final output, meaning 44 they are not considered when removing empty rows and columns. 45 46 #### run 47 48 ```python 49 run(documents: list[Document]) -> dict[str, list[Document]] 50 ``` 51 52 Cleans CSV documents by removing empty rows and columns while preserving specified ignored rows and columns. 53 54 **Parameters:** 55 56 - **documents** (<code>list\[Document\]</code>) – List of Documents containing CSV-formatted content. 57 58 **Returns:** 59 60 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a list of cleaned Documents under the key "documents". 61 62 Processing steps: 63 64 1. Reads each document's content as a CSV table. 65 1. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left. 66 1. Drops any rows and columns that are entirely empty (if enabled by `remove_empty_rows` and 67 `remove_empty_columns`). 68 1. Reattaches the ignored rows and columns to maintain their original positions. 69 1. Returns the cleaned CSV content as a new `Document` object, with an option to retain the original 70 document ID. 71 72 ## csv_document_splitter 73 74 ### CSVDocumentSplitter 75 76 A component for splitting CSV documents into sub-tables based on split arguments. 77 78 The splitter supports two modes of operation: 79 80 - identify consecutive empty rows or columns that exceed a given threshold 81 and uses them as delimiters to segment the document into smaller tables. 82 - split each row into a separate sub-table, represented as a Document. 83 84 #### __init__ 85 86 ```python 87 __init__( 88 row_split_threshold: int | None = 2, 89 column_split_threshold: int | None = 2, 90 read_csv_kwargs: dict[str, Any] | None = None, 91 split_mode: SplitMode = "threshold", 92 ) -> None 93 ``` 94 95 Initializes the CSVDocumentSplitter component. 96 97 **Parameters:** 98 99 - **row_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty rows required to trigger a split. 100 - **column_split_threshold** (<code>int | None</code>) – The minimum number of consecutive empty columns required to trigger a split. 101 - **read_csv_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments to pass to `pandas.read_csv`. 102 By default, the component with options: 103 - `header=None` 104 - `skip_blank_lines=False` to preserve blank lines 105 - `dtype=object` to prevent type inference (e.g., converting numbers to floats). 106 See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for more information. 107 - **split_mode** (<code>SplitMode</code>) – If `threshold`, the component will split the document based on the number of 108 consecutive empty rows or columns that exceed the `row_split_threshold` or `column_split_threshold`. 109 If `row-wise`, the component will split each row into a separate sub-table. 110 111 #### run 112 113 ```python 114 run(documents: list[Document]) -> dict[str, list[Document]] 115 ``` 116 117 Processes and splits a list of CSV documents into multiple sub-tables. 118 119 **Splitting Process:** 120 121 1. Applies a row-based split if `row_split_threshold` is provided. 122 1. Applies a column-based split if `column_split_threshold` is provided. 123 1. If both thresholds are specified, performs a recursive split by rows first, then columns, ensuring 124 further fragmentation of any sub-tables that still contain empty sections. 125 1. Sorts the resulting sub-tables based on their original positions within the document. 126 127 **Parameters:** 128 129 - **documents** (<code>list\[Document\]</code>) – A list of Documents containing CSV-formatted content. 130 Each document is assumed to contain one or more tables separated by empty rows or columns. 131 132 **Returns:** 133 134 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with a key `"documents"`, mapping to a list of new `Document` objects, 135 each representing an extracted sub-table from the original CSV. 136 The metadata of each document includes: 137 \- A field `source_id` to track the original document. 138 \- A field `row_idx_start` to indicate the starting row index of the sub-table in the original table. 139 \- A field `col_idx_start` to indicate the starting column index of the sub-table in the original table. 140 \- A field `split_id` to indicate the order of the split in the original document. 141 \- All other metadata copied from the original document. 142 143 - If a document cannot be processed, it is returned unchanged. 144 145 - The `meta` field from the original document is preserved in the split documents. 146 147 ## document_cleaner 148 149 ### DocumentCleaner 150 151 Cleans the text in the documents. 152 153 It removes extra whitespaces, 154 empty lines, specified substrings, regexes, 155 page headers and footers (in this order). 156 157 ### Usage example: 158 159 ```python 160 from haystack import Document 161 from haystack.components.preprocessors import DocumentCleaner 162 163 doc = Document(content="This is a document to clean\n\n\nsubstring to remove") 164 165 cleaner = DocumentCleaner(remove_substrings = ["substring to remove"]) 166 result = cleaner.run(documents=[doc]) 167 168 assert result["documents"][0].content == "This is a document to clean " 169 ``` 170 171 #### __init__ 172 173 ```python 174 __init__( 175 remove_empty_lines: bool = True, 176 remove_extra_whitespaces: bool = True, 177 remove_repeated_substrings: bool = False, 178 keep_id: bool = False, 179 remove_substrings: list[str] | None = None, 180 remove_regex: str | None = None, 181 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None, 182 ascii_only: bool = False, 183 strip_whitespaces: bool = False, 184 replace_regexes: dict[str, str] | None = None, 185 ) 186 ``` 187 188 Initialize DocumentCleaner. 189 190 **Parameters:** 191 192 - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines. 193 - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces. 194 - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings (headers and footers) from pages. 195 Pages must be separated by a form feed character "\\f", 196 which is supported by `TextFileToDocument` and `AzureOCRDocumentConverter`. 197 - **remove_substrings** (<code>list\[str\] | None</code>) – List of substrings to remove from the text. 198 - **remove_regex** (<code>str | None</code>) – Regex to match and replace substrings by "". 199 - **keep_id** (<code>bool</code>) – If `True`, keeps the IDs of the original documents. 200 - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text. 201 Note: This will run before any other steps. 202 - **ascii_only** (<code>bool</code>) – Whether to convert the text to ASCII only. 203 Will remove accents from characters and replace them with ASCII characters. 204 Other non-ASCII characters will be removed. 205 Note: This will run before any pattern matching or removal. 206 - **strip_whitespaces** (<code>bool</code>) – If `True`, removes leading and trailing whitespace from the document content 207 using Python's `str.strip()`. Unlike `remove_extra_whitespaces`, this only affects the beginning 208 and end of the text, preserving internal whitespace (useful for markdown formatting). 209 - **replace_regexes** (<code>dict\[str, str\] | None</code>) – A dictionary mapping regex patterns to their replacement strings. 210 For example, `{r'\n\n+': '\n'}` replaces multiple consecutive newlines with a single newline. 211 This is applied after `remove_regex` and allows custom replacements instead of just removal. 212 213 #### run 214 215 ```python 216 run(documents: list[Document]) 217 ``` 218 219 Cleans up the documents. 220 221 **Parameters:** 222 223 - **documents** (<code>list\[Document\]</code>) – List of Documents to clean. 224 225 **Returns:** 226 227 - – A dictionary with the following key: 228 - `documents`: List of cleaned Documents. 229 230 **Raises:** 231 232 - <code>TypeError</code> – if documents is not a list of Documents. 233 234 ## document_preprocessor 235 236 ### DocumentPreprocessor 237 238 A SuperComponent that first splits and then cleans documents. 239 240 This component consists of a DocumentSplitter followed by a DocumentCleaner in a single pipeline. 241 It takes a list of documents as input and returns a processed list of documents. 242 243 Usage example: 244 245 ```python 246 from haystack import Document 247 from haystack.components.preprocessors import DocumentPreprocessor 248 249 doc = Document(content="I love pizza!") 250 preprocessor = DocumentPreprocessor() 251 result = preprocessor.run(documents=[doc]) 252 print(result["documents"]) 253 ``` 254 255 #### __init__ 256 257 ```python 258 __init__( 259 *, 260 split_by: Literal[ 261 "function", "page", "passage", "period", "word", "line", "sentence" 262 ] = "word", 263 split_length: int = 250, 264 split_overlap: int = 0, 265 split_threshold: int = 0, 266 splitting_function: Callable[[str], list[str]] | None = None, 267 respect_sentence_boundary: bool = False, 268 language: Language = "en", 269 use_split_rules: bool = True, 270 extend_abbreviations: bool = True, 271 remove_empty_lines: bool = True, 272 remove_extra_whitespaces: bool = True, 273 remove_repeated_substrings: bool = False, 274 keep_id: bool = False, 275 remove_substrings: list[str] | None = None, 276 remove_regex: str | None = None, 277 unicode_normalization: Literal["NFC", "NFKC", "NFD", "NFKD"] | None = None, 278 ascii_only: bool = False 279 ) -> None 280 ``` 281 282 Initialize a DocumentPreProcessor that first splits and then cleans documents. 283 284 **Splitter Parameters**: 285 286 **Parameters:** 287 288 - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence". 289 - **split_length** (<code>int</code>) – The maximum number of units (words, lines, pages, and so on) in each split. 290 - **split_overlap** (<code>int</code>) – The number of overlapping units between consecutive splits. 291 - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split is smaller than this, it's merged 292 with the previous split. 293 - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – A custom function for splitting if `split_by="function"`. 294 - **respect_sentence_boundary** (<code>bool</code>) – If `True`, splits by words but tries not to break inside a sentence. 295 - **language** (<code>Language</code>) – Language used by the sentence tokenizer if `split_by="sentence"` or 296 `respect_sentence_boundary=True`. 297 - **use_split_rules** (<code>bool</code>) – Whether to apply additional splitting heuristics for the sentence splitter. 298 - **extend_abbreviations** (<code>bool</code>) – Whether to extend the sentence splitter with curated abbreviations for certain 299 languages. 300 301 **Cleaner Parameters**: 302 303 - **remove_empty_lines** (<code>bool</code>) – If `True`, removes empty lines. 304 - **remove_extra_whitespaces** (<code>bool</code>) – If `True`, removes extra whitespaces. 305 - **remove_repeated_substrings** (<code>bool</code>) – If `True`, removes repeated substrings like headers/footers across pages. 306 - **keep_id** (<code>bool</code>) – If `True`, keeps the original document IDs. 307 - **remove_substrings** (<code>list\[str\] | None</code>) – A list of strings to remove from the document content. 308 - **remove_regex** (<code>str | None</code>) – A regex pattern whose matches will be removed from the document content. 309 - **unicode_normalization** (<code>Literal['NFC', 'NFKC', 'NFD', 'NFKD'] | None</code>) – Unicode normalization form to apply to the text, for example `"NFC"`. 310 - **ascii_only** (<code>bool</code>) – If `True`, converts text to ASCII only. 311 312 #### to_dict 313 314 ```python 315 to_dict() -> dict[str, Any] 316 ``` 317 318 Serialize SuperComponent to a dictionary. 319 320 **Returns:** 321 322 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 323 324 #### from_dict 325 326 ```python 327 from_dict(data: dict[str, Any]) -> DocumentPreprocessor 328 ``` 329 330 Deserializes the SuperComponent from a dictionary. 331 332 **Parameters:** 333 334 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 335 336 **Returns:** 337 338 - <code>DocumentPreprocessor</code> – Deserialized SuperComponent. 339 340 ## document_splitter 341 342 ### DocumentSplitter 343 344 Splits long documents into smaller chunks. 345 346 This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations 347 and prevents exceeding language model context limits. 348 349 The DocumentSplitter is compatible with the following DocumentStores: 350 351 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 352 - [Chroma](https://docs.haystack.deepset.ai/docs/chromadocumentstore) limited support, overlapping information is 353 not stored 354 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 355 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 356 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 357 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) limited support, overlapping 358 information is not stored 359 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 360 - [Weaviate](https://docs.haystack.deepset.ai/docs/weaviatedocumentstore) 361 362 ### Usage example 363 364 ```python 365 from haystack import Document 366 from haystack.components.preprocessors import DocumentSplitter 367 368 doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.") 369 370 splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0) 371 result = splitter.run(documents=[doc]) 372 ``` 373 374 #### __init__ 375 376 ```python 377 __init__( 378 split_by: Literal[ 379 "function", "page", "passage", "period", "word", "line", "sentence" 380 ] = "word", 381 split_length: int = 200, 382 split_overlap: int = 0, 383 split_threshold: int = 0, 384 splitting_function: Callable[[str], list[str]] | None = None, 385 respect_sentence_boundary: bool = False, 386 language: Language = "en", 387 use_split_rules: bool = True, 388 extend_abbreviations: bool = True, 389 *, 390 skip_empty_documents: bool = True 391 ) 392 ``` 393 394 Initialize DocumentSplitter. 395 396 **Parameters:** 397 398 - **split_by** (<code>Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']</code>) – The unit for splitting your documents. Choose from: 399 - `word` for splitting by spaces (" ") 400 - `period` for splitting by periods (".") 401 - `page` for splitting by form feed ("\\f") 402 - `passage` for splitting by double line breaks ("\\n\\n") 403 - `line` for splitting each line ("\\n") 404 - `sentence` for splitting by NLTK sentence tokenizer 405 - **split_length** (<code>int</code>) – The maximum number of units in each split. 406 - **split_overlap** (<code>int</code>) – The number of overlapping units for each split. 407 - **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units 408 than the threshold, it's attached to the previous split. 409 - **splitting_function** (<code>Callable\\[[str\], list\[str\]\] | None</code>) – Necessary when `split_by` is set to "function". 410 This is a function which must accept a single `str` as input and return a `list` of `str` as output, 411 representing the chunks after splitting. 412 - **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word". 413 If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. 414 - **language** (<code>Language</code>) – Choose the language for the NLTK tokenizer. The default is English ("en"). 415 - **use_split_rules** (<code>bool</code>) – Choose whether to use additional split rules when splitting by `sentence`. 416 - **extend_abbreviations** (<code>bool</code>) – Choose whether to extend NLTK's PunktTokenizer abbreviations with a list 417 of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). 418 - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True. 419 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 420 from non-textual documents. 421 422 #### warm_up 423 424 ```python 425 warm_up() 426 ``` 427 428 Warm up the DocumentSplitter by loading the sentence tokenizer. 429 430 #### run 431 432 ```python 433 run(documents: list[Document]) 434 ``` 435 436 Split documents into smaller parts. 437 438 Splits documents by the unit expressed in `split_by`, with a length of `split_length` 439 and an overlap of `split_overlap`. 440 441 **Parameters:** 442 443 - **documents** (<code>list\[Document\]</code>) – The documents to split. 444 445 **Returns:** 446 447 - – A dictionary with the following key: 448 - `documents`: List of documents with the split texts. Each document includes: 449 - A metadata field `source_id` to track the original document. 450 - A metadata field `page_number` to track the original page number. 451 - All other metadata copied from the original document. 452 453 **Raises:** 454 455 - <code>TypeError</code> – if the input is not a list of Documents. 456 - <code>ValueError</code> – if the content of a document is None. 457 458 #### to_dict 459 460 ```python 461 to_dict() -> dict[str, Any] 462 ``` 463 464 Serializes the component to a dictionary. 465 466 #### from_dict 467 468 ```python 469 from_dict(data: dict[str, Any]) -> DocumentSplitter 470 ``` 471 472 Deserializes the component from a dictionary. 473 474 ## embedding_based_document_splitter 475 476 ### EmbeddingBasedDocumentSplitter 477 478 Splits documents based on embedding similarity using cosine distances between sequential sentence groups. 479 480 This component first splits text into sentences, optionally groups them, calculates embeddings for each group, 481 and then uses cosine distance between sequential embeddings to determine split points. Any distance above 482 the specified percentile is treated as a break point. The component also tracks page numbers based on form feed 483 characters (``) in the original document. 484 485 This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt. 486 487 ### Usage example 488 489 ```python 490 from haystack import Document 491 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 492 from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter 493 494 # Create a document with content that has a clear topic shift 495 doc = Document( 496 content="This is a first sentence. This is a second sentence. This is a third sentence. " 497 "Completely different topic. The same completely different topic." 498 ) 499 500 # Initialize the embedder to calculate semantic similarities 501 embedder = SentenceTransformersDocumentEmbedder() 502 503 # Configure the splitter with parameters that control splitting behavior 504 splitter = EmbeddingBasedDocumentSplitter( 505 document_embedder=embedder, 506 sentences_per_group=2, # Group 2 sentences before calculating embeddings 507 percentile=0.95, # Split when cosine distance exceeds 95th percentile 508 min_length=50, # Merge splits shorter than 50 characters 509 max_length=1000 # Further split chunks longer than 1000 characters 510 ) 511 result = splitter.run(documents=[doc]) 512 513 # The result contains a list of Document objects, each representing a semantic chunk 514 # Each split document includes metadata: source_id, split_id, and page_number 515 print(f"Original document split into {len(result['documents'])} chunks") 516 for i, split_doc in enumerate(result['documents']): 517 print(f"Chunk {i}: {split_doc.content[:50]}...") 518 ``` 519 520 #### __init__ 521 522 ```python 523 __init__( 524 *, 525 document_embedder: DocumentEmbedder, 526 sentences_per_group: int = 3, 527 percentile: float = 0.95, 528 min_length: int = 50, 529 max_length: int = 1000, 530 language: Language = "en", 531 use_split_rules: bool = True, 532 extend_abbreviations: bool = True 533 ) 534 ``` 535 536 Initialize EmbeddingBasedDocumentSplitter. 537 538 **Parameters:** 539 540 - **document_embedder** (<code>DocumentEmbedder</code>) – The DocumentEmbedder to use for calculating embeddings. 541 - **sentences_per_group** (<code>int</code>) – Number of sentences to group together before embedding. 542 - **percentile** (<code>float</code>) – Percentile threshold for cosine distance. Distances above this percentile 543 are treated as break points. 544 - **min_length** (<code>int</code>) – Minimum length of splits in characters. Splits below this length will be merged. 545 - **max_length** (<code>int</code>) – Maximum length of splits in characters. Splits above this length will be recursively split. 546 - **language** (<code>Language</code>) – Language for sentence tokenization. 547 - **use_split_rules** (<code>bool</code>) – Whether to use additional split rules for sentence tokenization. Applies additional 548 split rules from SentenceSplitter to the sentence spans. 549 - **extend_abbreviations** (<code>bool</code>) – If True, the abbreviations used by NLTK's PunktTokenizer are extended by a list 550 of curated abbreviations. Currently supported languages are: en, de. 551 If False, the default abbreviations are used. 552 553 #### warm_up 554 555 ```python 556 warm_up() -> None 557 ``` 558 559 Warm up the component by initializing the sentence splitter. 560 561 #### run 562 563 ```python 564 run(documents: list[Document]) -> dict[str, list[Document]] 565 ``` 566 567 Split documents based on embedding similarity. 568 569 **Parameters:** 570 571 - **documents** (<code>list\[Document\]</code>) – The documents to split. 572 573 **Returns:** 574 575 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key: 576 - `documents`: List of documents with the split texts. Each document includes: 577 - A metadata field `source_id` to track the original document. 578 - A metadata field `split_id` to track the split number. 579 - A metadata field `page_number` to track the original page number. 580 - All other metadata copied from the original document. 581 582 **Raises:** 583 584 - <code>RuntimeError</code> – If the component wasn't warmed up. 585 - <code>TypeError</code> – If the input is not a list of Documents. 586 - <code>ValueError</code> – If the document content is None or empty. 587 588 #### to_dict 589 590 ```python 591 to_dict() -> dict[str, Any] 592 ``` 593 594 Serializes the component to a dictionary. 595 596 **Returns:** 597 598 - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component. 599 600 #### from_dict 601 602 ```python 603 from_dict(data: dict[str, Any]) -> EmbeddingBasedDocumentSplitter 604 ``` 605 606 Deserializes the component from a dictionary. 607 608 **Parameters:** 609 610 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component. 611 612 **Returns:** 613 614 - <code>EmbeddingBasedDocumentSplitter</code> – The deserialized component. 615 616 ## hierarchical_document_splitter 617 618 ### HierarchicalDocumentSplitter 619 620 Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes. 621 622 The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between 623 are connected such that the smaller blocks are children of the parent-larger blocks. 624 625 ## Usage example 626 627 ```python 628 from haystack import Document 629 from haystack.components.preprocessors import HierarchicalDocumentSplitter 630 631 doc = Document(content="This is a simple test document") 632 splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word") 633 splitter.run([doc]) 634 >> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}), 635 >> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 636 >> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}), 637 >> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 638 >> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}), 639 >> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), 640 >> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]} 641 ``` 642 643 #### __init__ 644 645 ```python 646 __init__( 647 block_sizes: set[int], 648 split_overlap: int = 0, 649 split_by: Literal["word", "sentence", "page", "passage"] = "word", 650 ) 651 ``` 652 653 Initialize HierarchicalDocumentSplitter. 654 655 **Parameters:** 656 657 - **block_sizes** (<code>set\[int\]</code>) – Set of block sizes to split the document into. The blocks are split in descending order. 658 - **split_overlap** (<code>int</code>) – The number of overlapping units for each split. 659 - **split_by** (<code>Literal['word', 'sentence', 'page', 'passage']</code>) – The unit for splitting your documents. 660 661 #### run 662 663 ```python 664 run(documents: list[Document]) 665 ``` 666 667 Builds a hierarchical document structure for each document in a list of documents. 668 669 **Parameters:** 670 671 - **documents** (<code>list\[Document\]</code>) – List of Documents to split into hierarchical blocks. 672 673 **Returns:** 674 675 - – List of HierarchicalDocument 676 677 #### build_hierarchy_from_doc 678 679 ```python 680 build_hierarchy_from_doc(document: Document) -> list[Document] 681 ``` 682 683 Build a hierarchical tree document structure from a single document. 684 685 Given a document, this function splits the document into hierarchical blocks of different sizes represented 686 as HierarchicalDocument objects. 687 688 **Parameters:** 689 690 - **document** (<code>Document</code>) – Document to split into hierarchical blocks. 691 692 **Returns:** 693 694 - <code>list\[Document\]</code> – List of HierarchicalDocument 695 696 #### to_dict 697 698 ```python 699 to_dict() -> dict[str, Any] 700 ``` 701 702 Returns a dictionary representation of the component. 703 704 **Returns:** 705 706 - <code>dict\[str, Any\]</code> – Serialized dictionary representation of the component. 707 708 #### from_dict 709 710 ```python 711 from_dict(data: dict[str, Any]) -> HierarchicalDocumentSplitter 712 ``` 713 714 Deserialize this component from a dictionary. 715 716 **Parameters:** 717 718 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize and create the component. 719 720 **Returns:** 721 722 - <code>HierarchicalDocumentSplitter</code> – The deserialized component. 723 724 ## markdown_header_splitter 725 726 ### MarkdownHeaderSplitter 727 728 Split documents at ATX-style Markdown headers (#), with optional secondary splitting. 729 730 This component processes text documents by: 731 732 - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata. 733 - Optionally applying a secondary split (by word, passage, period, or line) to each chunk 734 (using haystack's DocumentSplitter). 735 - Preserving and propagating metadata such as parent headers, page numbers, and split IDs. 736 737 #### __init__ 738 739 ```python 740 __init__( 741 *, 742 page_break_character: str = "\x0c", 743 keep_headers: bool = True, 744 secondary_split: Literal["word", "passage", "period", "line"] | None = None, 745 split_length: int = 200, 746 split_overlap: int = 0, 747 split_threshold: int = 0, 748 skip_empty_documents: bool = True 749 ) 750 ``` 751 752 Initialize the MarkdownHeaderSplitter. 753 754 **Parameters:** 755 756 - **page_break_character** (<code>str</code>) – Character used to identify page breaks. Defaults to form feed (""). 757 - **keep_headers** (<code>bool</code>) – If True, headers are kept in the content. If False, headers are moved to metadata. 758 Defaults to True. 759 - **secondary_split** (<code>Literal['word', 'passage', 'period', 'line'] | None</code>) – Optional secondary split condition after header splitting. 760 Options are None, "word", "passage", "period", "line". Defaults to None. 761 - **split_length** (<code>int</code>) – The maximum number of units in each split when using secondary splitting. Defaults to 200. 762 - **split_overlap** (<code>int</code>) – The number of overlapping units for each split when using secondary splitting. 763 Defaults to 0. 764 - **split_threshold** (<code>int</code>) – The minimum number of units per split when using secondary splitting. Defaults to 0. 765 - **skip_empty_documents** (<code>bool</code>) – Choose whether to skip documents with empty content. Default is True. 766 Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text 767 from non-textual documents. 768 769 #### warm_up 770 771 ```python 772 warm_up() 773 ``` 774 775 Warm up the MarkdownHeaderSplitter. 776 777 #### run 778 779 ```python 780 run(documents: list[Document]) -> dict[str, list[Document]] 781 ``` 782 783 Run the markdown header splitter with optional secondary splitting. 784 785 **Parameters:** 786 787 - **documents** (<code>list\[Document\]</code>) – List of documents to split 788 789 **Returns:** 790 791 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key: 792 - `documents`: List of documents with the split texts. Each document includes: 793 - A metadata field `source_id` to track the original document. 794 - A metadata field `page_number` to track the original page number. 795 - A metadata field `split_id` to identify the split chunk index within its parent document. 796 - All other metadata copied from the original document. 797 798 **Raises:** 799 800 - <code>ValueError</code> – If a document has `None` content. 801 - <code>TypeError</code> – If a document's content is not a string. 802 803 ## recursive_splitter 804 805 ### RecursiveDocumentSplitter 806 807 Recursively chunk text into smaller chunks. 808 809 This component is used to split text into smaller chunks, it does so by recursively applying a list of separators 810 to the text. 811 812 The separators are applied in the order they are provided, typically this is a list of separators that are 813 applied in a specific order, being the last separator the most specific one. 814 815 Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that 816 are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the 817 list to the remaining text. 818 819 This is done until all chunks are smaller than the split_length parameter. 820 821 Example: 822 823 ```python 824 from haystack import Document 825 from haystack.components.preprocessors import RecursiveDocumentSplitter 826 827 chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "]) 828 text = ('''Artificial intelligence (AI) - Introduction 829 830 AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. 831 AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''') 832 doc = Document(content=text) 833 doc_chunks = chunker.run([doc]) 834 print(doc_chunks["documents"]) 835 >[ 836 >Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []}) 837 >Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []}) 838 >Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []}) 839 >Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []}) 840 >] 841 ``` 842 843 #### __init__ 844 845 ```python 846 __init__( 847 *, 848 split_length: int = 200, 849 split_overlap: int = 0, 850 split_unit: Literal["word", "char", "token"] = "word", 851 separators: list[str] | None = None, 852 sentence_splitter_params: dict[str, Any] | None = None 853 ) 854 ``` 855 856 Initializes a RecursiveDocumentSplitter. 857 858 **Parameters:** 859 860 - **split_length** (<code>int</code>) – The maximum length of each chunk by default in words, but can be in characters or tokens. 861 See the `split_units` parameter. 862 - **split_overlap** (<code>int</code>) – The number of characters to overlap between consecutive chunks. 863 - **split_unit** (<code>Literal['word', 'char', 'token']</code>) – The unit of the split_length parameter. It can be either "word", "char", or "token". 864 If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). 865 - **separators** (<code>list\[str\] | None</code>) – An optional list of separator strings to use for splitting the text. The string 866 separators will be treated as regular expressions unless the separator is "sentence", in that case the 867 text will be split into sentences using a custom sentence tokenizer based on NLTK. 868 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter. 869 If no separators are provided, the default separators ["\\n\\n", "sentence", "\\n", " "] are used. 870 - **sentence_splitter_params** (<code>dict\[str, Any\] | None</code>) – Optional parameters to pass to the sentence tokenizer. 871 See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information. 872 873 **Raises:** 874 875 - <code>ValueError</code> – If the overlap is greater than or equal to the chunk size or if the overlap is negative, or 876 if any separator is not a string. 877 878 #### warm_up 879 880 ```python 881 warm_up() -> None 882 ``` 883 884 Warm up the sentence tokenizer and tiktoken tokenizer if needed. 885 886 #### run 887 888 ```python 889 run(documents: list[Document]) -> dict[str, list[Document]] 890 ``` 891 892 Split a list of documents into documents with smaller chunks of text. 893 894 **Parameters:** 895 896 - **documents** (<code>list\[Document\]</code>) – List of Documents to split. 897 898 **Returns:** 899 900 - <code>dict\[str, list\[Document\]\]</code> – A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding 901 to the input documents. 902 903 ## text_cleaner 904 905 ### TextCleaner 906 907 Cleans text strings. 908 909 It can remove substrings matching a list of regular expressions, convert text to lowercase, 910 remove punctuation, and remove numbers. 911 Use it to clean up text data before evaluation. 912 913 ### Usage example 914 915 ```python 916 from haystack.components.preprocessors import TextCleaner 917 918 text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything." 919 920 cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True) 921 result = cleaner.run(texts=[text_to_clean]) 922 ``` 923 924 #### __init__ 925 926 ```python 927 __init__( 928 remove_regexps: list[str] | None = None, 929 convert_to_lowercase: bool = False, 930 remove_punctuation: bool = False, 931 remove_numbers: bool = False, 932 ) 933 ``` 934 935 Initializes the TextCleaner component. 936 937 **Parameters:** 938 939 - **remove_regexps** (<code>list\[str\] | None</code>) – A list of regex patterns to remove matching substrings from the text. 940 - **convert_to_lowercase** (<code>bool</code>) – If `True`, converts all characters to lowercase. 941 - **remove_punctuation** (<code>bool</code>) – If `True`, removes punctuation from the text. 942 - **remove_numbers** (<code>bool</code>) – If `True`, removes numerical digits from the text. 943 944 #### run 945 946 ```python 947 run(texts: list[str]) -> dict[str, Any] 948 ``` 949 950 Cleans up the given list of strings. 951 952 **Parameters:** 953 954 - **texts** (<code>list\[str\]</code>) – List of strings to clean. 955 956 **Returns:** 957 958 - <code>dict\[str, Any\]</code> – A dictionary with the following key: 959 - `texts`: the cleaned list of strings.