retrievers_api.md
1 --- 2 title: "Retrievers" 3 id: retrievers-api 4 description: "Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query." 5 slug: "/retrievers-api" 6 --- 7 8 <a id="auto_merging_retriever"></a> 9 10 ## Module auto\_merging\_retriever 11 12 <a id="auto_merging_retriever.AutoMergingRetriever"></a> 13 14 ### AutoMergingRetriever 15 16 A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting. 17 18 The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes 19 are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create 20 such a structure. During retrieval, if the number of matched leaf documents below the same parent is 21 higher than a defined threshold, the retriever will return the parent document instead of the individual leaf 22 documents. 23 24 The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for 25 a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual 26 chunks alone. 27 28 Currently the AutoMergingRetriever can only be used by the following DocumentStores: 29 - [AstraDB](https://haystack.deepset.ai/integrations/astradb) 30 - [ElasticSearch](https://haystack.deepset.ai/docs/latest/documentstore/elasticsearch) 31 - [OpenSearch](https://haystack.deepset.ai/docs/latest/documentstore/opensearch) 32 - [PGVector](https://haystack.deepset.ai/docs/latest/documentstore/pgvector) 33 - [Qdrant](https://haystack.deepset.ai/docs/latest/documentstore/qdrant) 34 35 ```python 36 from haystack import Document 37 from haystack.components.preprocessors import HierarchicalDocumentSplitter 38 from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever 39 from haystack.document_stores.in_memory import InMemoryDocumentStore 40 41 # create a hierarchical document structure with 3 levels, where the parent document has 3 children 42 text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing." 43 original_document = Document(content=text) 44 builder = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word") 45 docs = builder.run([original_document])["documents"] 46 47 # store level-1 parent documents and initialize the retriever 48 doc_store_parents = InMemoryDocumentStore() 49 for doc in docs: 50 if doc.meta["__children_ids"] and doc.meta["__level"] in [0,1]: # store the root document and level 1 documents 51 doc_store_parents.write_documents([doc]) 52 53 retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5) 54 55 # assume we retrieved 2 leaf docs from the same parent, the parent document should be returned, 56 # since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6)) 57 leaf_docs = [doc for doc in docs if not doc.meta["__children_ids"]] 58 retrieved_docs = retriever.run(leaf_docs[4:6]) 59 print(retrieved_docs["documents"]) 60 # [Document(id=538..), 61 # content: 'warm glow over the trees. Birds began to sing.', 62 # meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...', 63 # 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]} 64 ``` 65 66 <a id="auto_merging_retriever.AutoMergingRetriever.__init__"></a> 67 68 #### AutoMergingRetriever.\_\_init\_\_ 69 70 ```python 71 def __init__(document_store: DocumentStore, threshold: float = 0.5) 72 ``` 73 74 Initialize the AutoMergingRetriever. 75 76 **Arguments**: 77 78 - `document_store`: DocumentStore from which to retrieve the parent documents 79 - `threshold`: Threshold to decide whether the parent instead of the individual documents is returned 80 81 <a id="auto_merging_retriever.AutoMergingRetriever.to_dict"></a> 82 83 #### AutoMergingRetriever.to\_dict 84 85 ```python 86 def to_dict() -> dict[str, Any] 87 ``` 88 89 Serializes the component to a dictionary. 90 91 **Returns**: 92 93 Dictionary with serialized data. 94 95 <a id="auto_merging_retriever.AutoMergingRetriever.from_dict"></a> 96 97 #### AutoMergingRetriever.from\_dict 98 99 ```python 100 @classmethod 101 def from_dict(cls, data: dict[str, Any]) -> "AutoMergingRetriever" 102 ``` 103 104 Deserializes the component from a dictionary. 105 106 **Arguments**: 107 108 - `data`: Dictionary with serialized data. 109 110 **Returns**: 111 112 An instance of the component. 113 114 <a id="auto_merging_retriever.AutoMergingRetriever.run"></a> 115 116 #### AutoMergingRetriever.run 117 118 ```python 119 @component.output_types(documents=list[Document]) 120 def run(documents: list[Document]) 121 ``` 122 123 Run the AutoMergingRetriever. 124 125 Recursively groups documents by their parents and merges them if they meet the threshold, 126 continuing up the hierarchy until no more merges are possible. 127 128 **Arguments**: 129 130 - `documents`: List of leaf documents that were matched by a retriever 131 132 **Returns**: 133 134 List of documents (could be a mix of different hierarchy levels) 135 136 <a id="auto_merging_retriever.AutoMergingRetriever.run_async"></a> 137 138 #### AutoMergingRetriever.run\_async 139 140 ```python 141 @component.output_types(documents=list[Document]) 142 async def run_async(documents: list[Document]) 143 ``` 144 145 Asynchronously run the AutoMergingRetriever. 146 147 Recursively groups documents by their parents and merges them if they meet the threshold, 148 continuing up the hierarchy until no more merges are possible. 149 150 **Arguments**: 151 152 - `documents`: List of leaf documents that were matched by a retriever 153 154 **Returns**: 155 156 List of documents (could be a mix of different hierarchy levels) 157 158 <a id="filter_retriever"></a> 159 160 ## Module filter\_retriever 161 162 <a id="filter_retriever.FilterRetriever"></a> 163 164 ### FilterRetriever 165 166 Retrieves documents that match the provided filters. 167 168 ### Usage example 169 170 ```python 171 from haystack import Document 172 from haystack.components.retrievers import FilterRetriever 173 from haystack.document_stores.in_memory import InMemoryDocumentStore 174 175 docs = [ 176 Document(content="Python is a popular programming language", meta={"lang": "en"}), 177 Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}), 178 ] 179 180 doc_store = InMemoryDocumentStore() 181 doc_store.write_documents(docs) 182 retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"}) 183 184 # if passed in the run method, filters override those provided at initialization 185 result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"}) 186 187 print(result["documents"]) 188 ``` 189 190 <a id="filter_retriever.FilterRetriever.__init__"></a> 191 192 #### FilterRetriever.\_\_init\_\_ 193 194 ```python 195 def __init__(document_store: DocumentStore, 196 filters: dict[str, Any] | None = None) 197 ``` 198 199 Create the FilterRetriever component. 200 201 **Arguments**: 202 203 - `document_store`: An instance of a Document Store to use with the Retriever. 204 - `filters`: A dictionary with filters to narrow down the search space. 205 206 <a id="filter_retriever.FilterRetriever.to_dict"></a> 207 208 #### FilterRetriever.to\_dict 209 210 ```python 211 def to_dict() -> dict[str, Any] 212 ``` 213 214 Serializes the component to a dictionary. 215 216 **Returns**: 217 218 Dictionary with serialized data. 219 220 <a id="filter_retriever.FilterRetriever.from_dict"></a> 221 222 #### FilterRetriever.from\_dict 223 224 ```python 225 @classmethod 226 def from_dict(cls, data: dict[str, Any]) -> "FilterRetriever" 227 ``` 228 229 Deserializes the component from a dictionary. 230 231 **Arguments**: 232 233 - `data`: The dictionary to deserialize from. 234 235 **Returns**: 236 237 The deserialized component. 238 239 <a id="filter_retriever.FilterRetriever.run"></a> 240 241 #### FilterRetriever.run 242 243 ```python 244 @component.output_types(documents=list[Document]) 245 def run(filters: dict[str, Any] | None = None) 246 ``` 247 248 Run the FilterRetriever on the given input data. 249 250 **Arguments**: 251 252 - `filters`: A dictionary with filters to narrow down the search space. 253 If not specified, the FilterRetriever uses the values provided at initialization. 254 255 **Returns**: 256 257 A list of retrieved documents. 258 259 <a id="filter_retriever.FilterRetriever.run_async"></a> 260 261 #### FilterRetriever.run\_async 262 263 ```python 264 @component.output_types(documents=list[Document]) 265 async def run_async(filters: dict[str, Any] | None = None) 266 ``` 267 268 Asynchronously run the FilterRetriever on the given input data. 269 270 **Arguments**: 271 272 - `filters`: A dictionary with filters to narrow down the search space. 273 If not specified, the FilterRetriever uses the values provided at initialization. 274 275 **Returns**: 276 277 A list of retrieved documents. 278 279 <a id="in_memory/bm25_retriever"></a> 280 281 ## Module in\_memory/bm25\_retriever 282 283 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever"></a> 284 285 ### InMemoryBM25Retriever 286 287 Retrieves documents that are most similar to the query using keyword-based algorithm. 288 289 Use this retriever with the InMemoryDocumentStore. 290 291 ### Usage example 292 293 ```python 294 from haystack import Document 295 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 296 from haystack.document_stores.in_memory import InMemoryDocumentStore 297 298 docs = [ 299 Document(content="Python is a popular programming language"), 300 Document(content="python ist eine beliebte Programmiersprache"), 301 ] 302 303 doc_store = InMemoryDocumentStore() 304 doc_store.write_documents(docs) 305 retriever = InMemoryBM25Retriever(doc_store) 306 307 result = retriever.run(query="Programmiersprache") 308 309 print(result["documents"]) 310 ``` 311 312 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.__init__"></a> 313 314 #### InMemoryBM25Retriever.\_\_init\_\_ 315 316 ```python 317 def __init__(document_store: InMemoryDocumentStore, 318 filters: dict[str, Any] | None = None, 319 top_k: int = 10, 320 scale_score: bool = False, 321 filter_policy: FilterPolicy = FilterPolicy.REPLACE) 322 ``` 323 324 Create the InMemoryBM25Retriever component. 325 326 **Arguments**: 327 328 - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents. 329 - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store. 330 - `top_k`: The maximum number of documents to retrieve. 331 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 332 When `False`, uses raw similarity scores. 333 - `filter_policy`: The filter policy to apply during retrieval. 334 Filter policy determines how filters are applied when retrieving documents. You can choose: 335 - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime. 336 Use this policy to dynamically change filtering for specific queries. 337 - `MERGE`: Combines runtime filters with initialization filters to narrow down the search. 338 339 **Raises**: 340 341 - `ValueError`: If the specified `top_k` is not > 0. 342 343 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.to_dict"></a> 344 345 #### InMemoryBM25Retriever.to\_dict 346 347 ```python 348 def to_dict() -> dict[str, Any] 349 ``` 350 351 Serializes the component to a dictionary. 352 353 **Returns**: 354 355 Dictionary with serialized data. 356 357 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.from_dict"></a> 358 359 #### InMemoryBM25Retriever.from\_dict 360 361 ```python 362 @classmethod 363 def from_dict(cls, data: dict[str, Any]) -> "InMemoryBM25Retriever" 364 ``` 365 366 Deserializes the component from a dictionary. 367 368 **Arguments**: 369 370 - `data`: The dictionary to deserialize from. 371 372 **Returns**: 373 374 The deserialized component. 375 376 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run"></a> 377 378 #### InMemoryBM25Retriever.run 379 380 ```python 381 @component.output_types(documents=list[Document]) 382 def run(query: str, 383 filters: dict[str, Any] | None = None, 384 top_k: int | None = None, 385 scale_score: bool | None = None) -> dict[str, list[Document]] 386 ``` 387 388 Run the InMemoryBM25Retriever on the given input data. 389 390 **Arguments**: 391 392 - `query`: The query string for the Retriever. 393 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 394 - `top_k`: The maximum number of documents to return. 395 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 396 When `False`, uses raw similarity scores. 397 398 **Raises**: 399 400 - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance. 401 402 **Returns**: 403 404 The retrieved documents. 405 406 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run_async"></a> 407 408 #### InMemoryBM25Retriever.run\_async 409 410 ```python 411 @component.output_types(documents=list[Document]) 412 async def run_async( 413 query: str, 414 filters: dict[str, Any] | None = None, 415 top_k: int | None = None, 416 scale_score: bool | None = None) -> dict[str, list[Document]] 417 ``` 418 419 Run the InMemoryBM25Retriever on the given input data. 420 421 **Arguments**: 422 423 - `query`: The query string for the Retriever. 424 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 425 - `top_k`: The maximum number of documents to return. 426 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 427 When `False`, uses raw similarity scores. 428 429 **Raises**: 430 431 - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance. 432 433 **Returns**: 434 435 The retrieved documents. 436 437 <a id="in_memory/embedding_retriever"></a> 438 439 ## Module in\_memory/embedding\_retriever 440 441 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever"></a> 442 443 ### InMemoryEmbeddingRetriever 444 445 Retrieves documents that are most semantically similar to the query. 446 447 Use this retriever with the InMemoryDocumentStore. 448 449 When using this retriever, make sure it has query and document embeddings available. 450 In indexing pipelines, use a DocumentEmbedder to embed documents. 451 In query pipelines, use a TextEmbedder to embed queries and send them to the retriever. 452 453 ### Usage example 454 ```python 455 from haystack import Document 456 from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder 457 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 458 from haystack.document_stores.in_memory import InMemoryDocumentStore 459 460 docs = [ 461 Document(content="Python is a popular programming language"), 462 Document(content="python ist eine beliebte Programmiersprache"), 463 ] 464 doc_embedder = SentenceTransformersDocumentEmbedder() 465 doc_embedder.warm_up() 466 docs_with_embeddings = doc_embedder.run(docs)["documents"] 467 468 doc_store = InMemoryDocumentStore() 469 doc_store.write_documents(docs_with_embeddings) 470 retriever = InMemoryEmbeddingRetriever(doc_store) 471 472 query="Programmiersprache" 473 text_embedder = SentenceTransformersTextEmbedder() 474 text_embedder.warm_up() 475 query_embedding = text_embedder.run(query)["embedding"] 476 477 result = retriever.run(query_embedding=query_embedding) 478 479 print(result["documents"]) 480 ``` 481 482 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.__init__"></a> 483 484 #### InMemoryEmbeddingRetriever.\_\_init\_\_ 485 486 ```python 487 def __init__(document_store: InMemoryDocumentStore, 488 filters: dict[str, Any] | None = None, 489 top_k: int = 10, 490 scale_score: bool = False, 491 return_embedding: bool = False, 492 filter_policy: FilterPolicy = FilterPolicy.REPLACE) 493 ``` 494 495 Create the InMemoryEmbeddingRetriever component. 496 497 **Arguments**: 498 499 - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents. 500 - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store. 501 - `top_k`: The maximum number of documents to retrieve. 502 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 503 When `False`, uses raw similarity scores. 504 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 505 When `False`, returns just the documents, without their embeddings. 506 - `filter_policy`: The filter policy to apply during retrieval. 507 Filter policy determines how filters are applied when retrieving documents. You can choose: 508 - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime. 509 Use this policy to dynamically change filtering for specific queries. 510 - `MERGE`: Combines runtime filters with initialization filters to narrow down the search. 511 512 **Raises**: 513 514 - `ValueError`: If the specified top_k is not > 0. 515 516 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.to_dict"></a> 517 518 #### InMemoryEmbeddingRetriever.to\_dict 519 520 ```python 521 def to_dict() -> dict[str, Any] 522 ``` 523 524 Serializes the component to a dictionary. 525 526 **Returns**: 527 528 Dictionary with serialized data. 529 530 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.from_dict"></a> 531 532 #### InMemoryEmbeddingRetriever.from\_dict 533 534 ```python 535 @classmethod 536 def from_dict(cls, data: dict[str, Any]) -> "InMemoryEmbeddingRetriever" 537 ``` 538 539 Deserializes the component from a dictionary. 540 541 **Arguments**: 542 543 - `data`: The dictionary to deserialize from. 544 545 **Returns**: 546 547 The deserialized component. 548 549 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run"></a> 550 551 #### InMemoryEmbeddingRetriever.run 552 553 ```python 554 @component.output_types(documents=list[Document]) 555 def run(query_embedding: list[float], 556 filters: dict[str, Any] | None = None, 557 top_k: int | None = None, 558 scale_score: bool | None = None, 559 return_embedding: bool | None = None) -> dict[str, list[Document]] 560 ``` 561 562 Run the InMemoryEmbeddingRetriever on the given input data. 563 564 **Arguments**: 565 566 - `query_embedding`: Embedding of the query. 567 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 568 - `top_k`: The maximum number of documents to return. 569 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 570 When `False`, uses raw similarity scores. 571 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 572 When `False`, returns just the documents, without their embeddings. 573 574 **Raises**: 575 576 - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance. 577 578 **Returns**: 579 580 The retrieved documents. 581 582 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run_async"></a> 583 584 #### InMemoryEmbeddingRetriever.run\_async 585 586 ```python 587 @component.output_types(documents=list[Document]) 588 async def run_async( 589 query_embedding: list[float], 590 filters: dict[str, Any] | None = None, 591 top_k: int | None = None, 592 scale_score: bool | None = None, 593 return_embedding: bool | None = None) -> dict[str, list[Document]] 594 ``` 595 596 Run the InMemoryEmbeddingRetriever on the given input data. 597 598 **Arguments**: 599 600 - `query_embedding`: Embedding of the query. 601 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 602 - `top_k`: The maximum number of documents to return. 603 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 604 When `False`, uses raw similarity scores. 605 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 606 When `False`, returns just the documents, without their embeddings. 607 608 **Raises**: 609 610 - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance. 611 612 **Returns**: 613 614 The retrieved documents. 615 616 <a id="multi_query_embedding_retriever"></a> 617 618 ## Module multi\_query\_embedding\_retriever 619 620 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever"></a> 621 622 ### MultiQueryEmbeddingRetriever 623 624 A component that retrieves documents using multiple queries in parallel with an embedding-based retriever. 625 626 This component takes a list of text queries, converts them to embeddings using a query embedder, 627 and then uses an embedding-based retriever to find relevant documents for each query in parallel. 628 The results are combined and sorted by relevance score. 629 630 ### Usage example 631 632 ```python 633 from haystack import Document 634 from haystack.document_stores.in_memory import InMemoryDocumentStore 635 from haystack.document_stores.types import DuplicatePolicy 636 from haystack.components.embedders import SentenceTransformersTextEmbedder 637 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 638 from haystack.components.retrievers import InMemoryEmbeddingRetriever 639 from haystack.components.writers import DocumentWriter 640 from haystack.components.retrievers import MultiQueryEmbeddingRetriever 641 642 documents = [ 643 Document(content="Renewable energy is energy that is collected from renewable resources."), 644 Document(content="Solar energy is a type of green energy that is harnessed from the sun."), 645 Document(content="Wind energy is another type of green energy that is generated by wind turbines."), 646 Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."), 647 Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."), 648 Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."), 649 ] 650 651 # Populate the document store 652 doc_store = InMemoryDocumentStore() 653 doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 654 doc_embedder.warm_up() 655 doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP) 656 documents = doc_embedder.run(documents)["documents"] 657 doc_writer.run(documents=documents) 658 659 # Run the multi-query retriever 660 in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1) 661 query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 662 663 multi_query_retriever = MultiQueryEmbeddingRetriever( 664 retriever=in_memory_retriever, 665 query_embedder=query_embedder, 666 max_workers=3 667 ) 668 669 queries = ["Geothermal energy", "natural gas", "turbines"] 670 result = multi_query_retriever.run(queries=queries) 671 for doc in result["documents"]: 672 print(f"Content: {doc.content}, Score: {doc.score}") 673 # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574 674 # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034 675 # >> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354 676 # >> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680 677 # >> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.30914239725622 678 # >> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243 679 ``` 680 681 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.__init__"></a> 682 683 #### MultiQueryEmbeddingRetriever.\_\_init\_\_ 684 685 ```python 686 def __init__(*, 687 retriever: EmbeddingRetriever, 688 query_embedder: TextEmbedder, 689 max_workers: int = 3) -> None 690 ``` 691 692 Initialize MultiQueryEmbeddingRetriever. 693 694 **Arguments**: 695 696 - `retriever`: The embedding-based retriever to use for document retrieval. 697 - `query_embedder`: The query embedder to convert text queries to embeddings. 698 - `max_workers`: Maximum number of worker threads for parallel processing. 699 700 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.warm_up"></a> 701 702 #### MultiQueryEmbeddingRetriever.warm\_up 703 704 ```python 705 def warm_up() -> None 706 ``` 707 708 Warm up the query embedder and the retriever if any has a warm_up method. 709 710 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.run"></a> 711 712 #### MultiQueryEmbeddingRetriever.run 713 714 ```python 715 @component.output_types(documents=list[Document]) 716 def run( 717 queries: list[str], 718 retriever_kwargs: dict[str, Any] | None = None 719 ) -> dict[str, list[Document]] 720 ``` 721 722 Retrieve documents using multiple queries in parallel. 723 724 **Arguments**: 725 726 - `queries`: List of text queries to process. 727 - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method. 728 729 **Returns**: 730 731 A dictionary containing: 732 - `documents`: List of retrieved documents sorted by relevance score. 733 734 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.to_dict"></a> 735 736 #### MultiQueryEmbeddingRetriever.to\_dict 737 738 ```python 739 def to_dict() -> dict[str, Any] 740 ``` 741 742 Serializes the component to a dictionary. 743 744 **Returns**: 745 746 A dictionary representing the serialized component. 747 748 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.from_dict"></a> 749 750 #### MultiQueryEmbeddingRetriever.from\_dict 751 752 ```python 753 @classmethod 754 def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever" 755 ``` 756 757 Deserializes the component from a dictionary. 758 759 **Arguments**: 760 761 - `data`: The dictionary to deserialize from. 762 763 **Returns**: 764 765 The deserialized component. 766 767 <a id="multi_query_text_retriever"></a> 768 769 ## Module multi\_query\_text\_retriever 770 771 <a id="multi_query_text_retriever.MultiQueryTextRetriever"></a> 772 773 ### MultiQueryTextRetriever 774 775 A component that retrieves documents using multiple queries in parallel with a text-based retriever. 776 777 This component takes a list of text queries and uses a text-based retriever to find relevant documents for each 778 query in parallel, using a thread pool to manage concurrent execution. The results are combined and sorted by 779 relevance score. 780 781 You can use this component in combination with QueryExpander component to enhance the retrieval process. 782 783 ### Usage example 784 ```python 785 from haystack import Document 786 from haystack.components.writers import DocumentWriter 787 from haystack.document_stores.in_memory import InMemoryDocumentStore 788 from haystack.document_stores.types import DuplicatePolicy 789 from haystack.components.retrievers import InMemoryBM25Retriever 790 from haystack.components.query import QueryExpander 791 from haystack.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever 792 793 documents = [ 794 Document(content="Renewable energy is energy that is collected from renewable resources."), 795 Document(content="Solar energy is a type of green energy that is harnessed from the sun."), 796 Document(content="Wind energy is another type of green energy that is generated by wind turbines."), 797 Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."), 798 Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.") 799 ] 800 801 document_store = InMemoryDocumentStore() 802 doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP) 803 doc_writer.run(documents=documents) 804 805 in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1) 806 multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever) 807 results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"]) 808 for doc in results["documents"]: 809 print(f"Content: {doc.content}, Score: {doc.score}") 810 # >> 811 # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097 812 # >> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.615 813 # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944 814 ``` 815 816 <a id="multi_query_text_retriever.MultiQueryTextRetriever.__init__"></a> 817 818 #### MultiQueryTextRetriever.\_\_init\_\_ 819 820 ```python 821 def __init__(*, retriever: TextRetriever, max_workers: int = 3) -> None 822 ``` 823 824 Initialize MultiQueryTextRetriever. 825 826 **Arguments**: 827 828 - `retriever`: The text-based retriever to use for document retrieval. 829 - `max_workers`: Maximum number of worker threads for parallel processing. Default is 3. 830 831 <a id="multi_query_text_retriever.MultiQueryTextRetriever.warm_up"></a> 832 833 #### MultiQueryTextRetriever.warm\_up 834 835 ```python 836 def warm_up() -> None 837 ``` 838 839 Warm up the retriever if it has a warm_up method. 840 841 <a id="multi_query_text_retriever.MultiQueryTextRetriever.run"></a> 842 843 #### MultiQueryTextRetriever.run 844 845 ```python 846 @component.output_types(documents=list[Document]) 847 def run( 848 queries: list[str], 849 retriever_kwargs: dict[str, Any] | None = None 850 ) -> dict[str, list[Document]] 851 ``` 852 853 Retrieve documents using multiple queries in parallel. 854 855 **Arguments**: 856 857 - `queries`: List of text queries to process. 858 - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method. 859 860 **Returns**: 861 862 A dictionary containing: 863 `documents`: List of retrieved documents sorted by relevance score. 864 865 <a id="multi_query_text_retriever.MultiQueryTextRetriever.to_dict"></a> 866 867 #### MultiQueryTextRetriever.to\_dict 868 869 ```python 870 def to_dict() -> dict[str, Any] 871 ``` 872 873 Serializes the component to a dictionary. 874 875 **Returns**: 876 877 The serialized component as a dictionary. 878 879 <a id="multi_query_text_retriever.MultiQueryTextRetriever.from_dict"></a> 880 881 #### MultiQueryTextRetriever.from\_dict 882 883 ```python 884 @classmethod 885 def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever" 886 ``` 887 888 Deserializes the component from a dictionary. 889 890 **Arguments**: 891 892 - `data`: The dictionary to deserialize from. 893 894 **Returns**: 895 896 The deserialized component. 897 898 <a id="sentence_window_retriever"></a> 899 900 ## Module sentence\_window\_retriever 901 902 <a id="sentence_window_retriever.SentenceWindowRetriever"></a> 903 904 ### SentenceWindowRetriever 905 906 Retrieves neighboring documents from a DocumentStore to provide context for query results. 907 908 This component is intended to be used after a Retriever (e.g., BM25Retriever, EmbeddingRetriever). 909 It enhances retrieved results by fetching adjacent document chunks to give 910 additional context for the user. 911 912 The documents must include metadata indicating their origin and position: 913 - `source_id` is used to group sentence chunks belonging to the same original document. 914 - `split_id` represents the position/order of the chunk within the document. 915 916 The number of adjacent documents to include on each side of the retrieved document can be configured using the 917 `window_size` parameter. You can also specify which metadata fields to use for source and split ID 918 via `source_id_meta_field` and `split_id_meta_field`. 919 920 The SentenceWindowRetriever is compatible with the following DocumentStores: 921 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 922 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 923 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 924 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 925 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) 926 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 927 928 ### Usage example 929 930 ```python 931 from haystack import Document, Pipeline 932 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 933 from haystack.components.retrievers import SentenceWindowRetriever 934 from haystack.components.preprocessors import DocumentSplitter 935 from haystack.document_stores.in_memory import InMemoryDocumentStore 936 937 splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word") 938 text = ( 939 "This is a text with some words. There is a second sentence. And there is also a third sentence. " 940 "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence" 941 ) 942 doc = Document(content=text) 943 docs = splitter.run([doc]) 944 doc_store = InMemoryDocumentStore() 945 doc_store.write_documents(docs["documents"]) 946 947 948 rag = Pipeline() 949 rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1)) 950 rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=2)) 951 rag.connect("bm25_retriever", "sentence_window_retriever") 952 953 rag.run({'bm25_retriever': {"query":"third"}}) 954 955 # >> {'sentence_window_retriever': {'context_windows': ['some words. There is a second sentence. 956 # >> And there is also a third sentence. It also contains a fourth sentence. And a fifth sentence. And a sixth 957 # >> sentence. And a'], 'context_documents': [[Document(id=..., content: 'some words. There is a second sentence. 958 # >> And there is ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 1, 'split_idx_start': 20, 959 # >> '_split_overlap': [{'doc_id': '...', 'range': (20, 43)}, {'doc_id': '...', 'range': (0, 30)}]}), 960 # >> Document(id=..., content: 'second sentence. And there is also a third sentence. It ', 961 # >> meta: {'source_id': '74ea87deb38012873cf8c07e...f19d01a26a098447113e1d7b83efd30c02987114', 'page_number': 1, 962 # >> 'split_id': 2, 'split_idx_start': 43, '_split_overlap': [{'doc_id': '...', 'range': (23, 53)}, {'doc_id': '.', 963 # >> 'range': (0, 26)}]}), Document(id=..., content: 'also a third sentence. It also contains a fourth sentence. ', 964 # >> meta: {'source_id': '...', 'page_number': 1, 'split_id': 3, 'split_idx_start': 73, '_split_overlap': 965 # >> [{'doc_id': '...', 'range': (30, 56)}, {'doc_id': '...', 'range': (0, 33)}]}), Document(id=..., content: 966 # >> 'also contains a fourth sentence. And a fifth sentence. And ', meta: {'source_id': '...', 'page_number': 1, 967 # >> 'split_id': 4, 'split_idx_start': 99, '_split_overlap': [{'doc_id': '...', 'range': (26, 59)}, 968 # >> {'doc_id': '...', 'range': (0, 26)}]}), Document(id=..., content: 'And a fifth sentence. And a sixth sentence. 969 # >> And a ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 5, 'split_idx_start': 132, 970 # >> '_split_overlap': [{'doc_id': '...', 'range': (33, 59)}, {'doc_id': '...', 'range': (0, 24)}]})]]}}}} 971 ``` 972 973 <a id="sentence_window_retriever.SentenceWindowRetriever.__init__"></a> 974 975 #### SentenceWindowRetriever.\_\_init\_\_ 976 977 ```python 978 def __init__(document_store: DocumentStore, 979 window_size: int = 3, 980 *, 981 source_id_meta_field: str | list[str] = "source_id", 982 split_id_meta_field: str = "split_id", 983 raise_on_missing_meta_fields: bool = True) 984 ``` 985 986 Creates a new SentenceWindowRetriever component. 987 988 **Arguments**: 989 990 - `document_store`: The Document Store to retrieve the surrounding documents from. 991 - `window_size`: The number of documents to retrieve before and after the relevant one. 992 For example, `window_size: 2` fetches 2 preceding and 2 following documents. 993 - `source_id_meta_field`: The metadata field that contains the source ID of the document. 994 This can be a single field or a list of fields. If multiple fields are provided, the retriever will 995 consider the document as part of the same source if all the fields match. 996 - `split_id_meta_field`: The metadata field that contains the split ID of the document. 997 - `raise_on_missing_meta_fields`: If True, raises an error if the documents do not contain the required 998 metadata fields. If False, it will skip retrieving the context for documents that are missing 999 the required metadata fields, but will still include the original document in the results. 1000 1001 <a id="sentence_window_retriever.SentenceWindowRetriever.merge_documents_text"></a> 1002 1003 #### SentenceWindowRetriever.merge\_documents\_text 1004 1005 ```python 1006 @staticmethod 1007 def merge_documents_text(documents: list[Document]) -> str 1008 ``` 1009 1010 Merge a list of document text into a single string. 1011 1012 This functions concatenates the textual content of a list of documents into a single string, eliminating any 1013 overlapping content. 1014 1015 **Arguments**: 1016 1017 - `documents`: List of Documents to merge. 1018 1019 <a id="sentence_window_retriever.SentenceWindowRetriever.to_dict"></a> 1020 1021 #### SentenceWindowRetriever.to\_dict 1022 1023 ```python 1024 def to_dict() -> dict[str, Any] 1025 ``` 1026 1027 Serializes the component to a dictionary. 1028 1029 **Returns**: 1030 1031 Dictionary with serialized data. 1032 1033 <a id="sentence_window_retriever.SentenceWindowRetriever.from_dict"></a> 1034 1035 #### SentenceWindowRetriever.from\_dict 1036 1037 ```python 1038 @classmethod 1039 def from_dict(cls, data: dict[str, Any]) -> "SentenceWindowRetriever" 1040 ``` 1041 1042 Deserializes the component from a dictionary. 1043 1044 **Returns**: 1045 1046 Deserialized component. 1047 1048 <a id="sentence_window_retriever.SentenceWindowRetriever.run"></a> 1049 1050 #### SentenceWindowRetriever.run 1051 1052 ```python 1053 @component.output_types(context_windows=list[str], 1054 context_documents=list[Document]) 1055 def run(retrieved_documents: list[Document], window_size: int | None = None) 1056 ``` 1057 1058 Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store. 1059 1060 Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given 1061 document from the document store. 1062 1063 **Arguments**: 1064 1065 - `retrieved_documents`: List of retrieved documents from the previous retriever. 1066 - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite 1067 the `window_size` parameter set in the constructor. 1068 1069 **Returns**: 1070 1071 A dictionary with the following keys: 1072 - `context_windows`: A list of strings, where each string represents the concatenated text from the 1073 context window of the corresponding document in `retrieved_documents`. 1074 - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context 1075 document surrounding them. The documents are sorted by the `split_idx_start` 1076 meta field. 1077 1078 <a id="sentence_window_retriever.SentenceWindowRetriever.run_async"></a> 1079 1080 #### SentenceWindowRetriever.run\_async 1081 1082 ```python 1083 @component.output_types(context_windows=list[str], 1084 context_documents=list[Document]) 1085 async def run_async(retrieved_documents: list[Document], 1086 window_size: int | None = None) 1087 ``` 1088 1089 Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store. 1090 1091 Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given 1092 document from the document store. 1093 1094 **Arguments**: 1095 1096 - `retrieved_documents`: List of retrieved documents from the previous retriever. 1097 - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite 1098 the `window_size` parameter set in the constructor. 1099 1100 **Returns**: 1101 1102 A dictionary with the following keys: 1103 - `context_windows`: A list of strings, where each string represents the concatenated text from the 1104 context window of the corresponding document in `retrieved_documents`. 1105 - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context 1106 document surrounding them. The documents are sorted by the `split_idx_start` 1107 meta field. 1108