retrievers_api.md
1 --- 2 title: "Retrievers" 3 id: retrievers-api 4 description: "Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query." 5 slug: "/retrievers-api" 6 --- 7 8 <a id="auto_merging_retriever"></a> 9 10 ## Module auto\_merging\_retriever 11 12 <a id="auto_merging_retriever.AutoMergingRetriever"></a> 13 14 ### AutoMergingRetriever 15 16 A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting. 17 18 The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes 19 are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create 20 such a structure. During retrieval, if the number of matched leaf documents below the same parent is 21 higher than a defined threshold, the retriever will return the parent document instead of the individual leaf 22 documents. 23 24 The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for 25 a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual 26 chunks alone. 27 28 Currently the AutoMergingRetriever can only be used by the following DocumentStores: 29 - [AstraDB](https://haystack.deepset.ai/integrations/astradb) 30 - [ElasticSearch](https://haystack.deepset.ai/docs/latest/documentstore/elasticsearch) 31 - [OpenSearch](https://haystack.deepset.ai/docs/latest/documentstore/opensearch) 32 - [PGVector](https://haystack.deepset.ai/docs/latest/documentstore/pgvector) 33 - [Qdrant](https://haystack.deepset.ai/docs/latest/documentstore/qdrant) 34 35 ```python 36 from haystack import Document 37 from haystack.components.preprocessors import HierarchicalDocumentSplitter 38 from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever 39 from haystack.document_stores.in_memory import InMemoryDocumentStore 40 41 # create a hierarchical document structure with 3 levels, where the parent document has 3 children 42 text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing." 43 original_document = Document(content=text) 44 builder = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word") 45 docs = builder.run([original_document])["documents"] 46 47 # store level-1 parent documents and initialize the retriever 48 doc_store_parents = InMemoryDocumentStore() 49 for doc in docs: 50 if doc.meta["__children_ids"] and doc.meta["__level"] in [0,1]: # store the root document and level 1 documents 51 doc_store_parents.write_documents([doc]) 52 53 retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5) 54 55 # assume we retrieved 2 leaf docs from the same parent, the parent document should be returned, 56 # since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6)) 57 leaf_docs = [doc for doc in docs if not doc.meta["__children_ids"]] 58 retrieved_docs = retriever.run(leaf_docs[4:6]) 59 print(retrieved_docs["documents"]) 60 # [Document(id=538..), 61 # content: 'warm glow over the trees. Birds began to sing.', 62 # meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...', 63 # 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]} 64 ``` 65 66 <a id="auto_merging_retriever.AutoMergingRetriever.__init__"></a> 67 68 #### AutoMergingRetriever.\_\_init\_\_ 69 70 ```python 71 def __init__(document_store: DocumentStore, threshold: float = 0.5) 72 ``` 73 74 Initialize the AutoMergingRetriever. 75 76 **Arguments**: 77 78 - `document_store`: DocumentStore from which to retrieve the parent documents 79 - `threshold`: Threshold to decide whether the parent instead of the individual documents is returned 80 81 <a id="auto_merging_retriever.AutoMergingRetriever.to_dict"></a> 82 83 #### AutoMergingRetriever.to\_dict 84 85 ```python 86 def to_dict() -> dict[str, Any] 87 ``` 88 89 Serializes the component to a dictionary. 90 91 **Returns**: 92 93 Dictionary with serialized data. 94 95 <a id="auto_merging_retriever.AutoMergingRetriever.from_dict"></a> 96 97 #### AutoMergingRetriever.from\_dict 98 99 ```python 100 @classmethod 101 def from_dict(cls, data: dict[str, Any]) -> "AutoMergingRetriever" 102 ``` 103 104 Deserializes the component from a dictionary. 105 106 **Arguments**: 107 108 - `data`: Dictionary with serialized data. 109 110 **Returns**: 111 112 An instance of the component. 113 114 <a id="auto_merging_retriever.AutoMergingRetriever.run"></a> 115 116 #### AutoMergingRetriever.run 117 118 ```python 119 @component.output_types(documents=list[Document]) 120 def run(documents: list[Document]) 121 ``` 122 123 Run the AutoMergingRetriever. 124 125 Recursively groups documents by their parents and merges them if they meet the threshold, 126 continuing up the hierarchy until no more merges are possible. 127 128 **Arguments**: 129 130 - `documents`: List of leaf documents that were matched by a retriever 131 132 **Returns**: 133 134 List of documents (could be a mix of different hierarchy levels) 135 136 <a id="auto_merging_retriever.AutoMergingRetriever.run_async"></a> 137 138 #### AutoMergingRetriever.run\_async 139 140 ```python 141 @component.output_types(documents=list[Document]) 142 async def run_async(documents: list[Document]) 143 ``` 144 145 Asynchronously run the AutoMergingRetriever. 146 147 Recursively groups documents by their parents and merges them if they meet the threshold, 148 continuing up the hierarchy until no more merges are possible. 149 150 **Arguments**: 151 152 - `documents`: List of leaf documents that were matched by a retriever 153 154 **Returns**: 155 156 List of documents (could be a mix of different hierarchy levels) 157 158 <a id="filter_retriever"></a> 159 160 ## Module filter\_retriever 161 162 <a id="filter_retriever.FilterRetriever"></a> 163 164 ### FilterRetriever 165 166 Retrieves documents that match the provided filters. 167 168 ### Usage example 169 170 ```python 171 from haystack import Document 172 from haystack.components.retrievers import FilterRetriever 173 from haystack.document_stores.in_memory import InMemoryDocumentStore 174 175 docs = [ 176 Document(content="Python is a popular programming language", meta={"lang": "en"}), 177 Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}), 178 ] 179 180 doc_store = InMemoryDocumentStore() 181 doc_store.write_documents(docs) 182 retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"}) 183 184 # if passed in the run method, filters override those provided at initialization 185 result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"}) 186 187 print(result["documents"]) 188 ``` 189 190 <a id="filter_retriever.FilterRetriever.__init__"></a> 191 192 #### FilterRetriever.\_\_init\_\_ 193 194 ```python 195 def __init__(document_store: DocumentStore, 196 filters: dict[str, Any] | None = None) 197 ``` 198 199 Create the FilterRetriever component. 200 201 **Arguments**: 202 203 - `document_store`: An instance of a Document Store to use with the Retriever. 204 - `filters`: A dictionary with filters to narrow down the search space. 205 206 <a id="filter_retriever.FilterRetriever.to_dict"></a> 207 208 #### FilterRetriever.to\_dict 209 210 ```python 211 def to_dict() -> dict[str, Any] 212 ``` 213 214 Serializes the component to a dictionary. 215 216 **Returns**: 217 218 Dictionary with serialized data. 219 220 <a id="filter_retriever.FilterRetriever.from_dict"></a> 221 222 #### FilterRetriever.from\_dict 223 224 ```python 225 @classmethod 226 def from_dict(cls, data: dict[str, Any]) -> "FilterRetriever" 227 ``` 228 229 Deserializes the component from a dictionary. 230 231 **Arguments**: 232 233 - `data`: The dictionary to deserialize from. 234 235 **Returns**: 236 237 The deserialized component. 238 239 <a id="filter_retriever.FilterRetriever.run"></a> 240 241 #### FilterRetriever.run 242 243 ```python 244 @component.output_types(documents=list[Document]) 245 def run(filters: dict[str, Any] | None = None) 246 ``` 247 248 Run the FilterRetriever on the given input data. 249 250 **Arguments**: 251 252 - `filters`: A dictionary with filters to narrow down the search space. 253 If not specified, the FilterRetriever uses the values provided at initialization. 254 255 **Returns**: 256 257 A list of retrieved documents. 258 259 <a id="filter_retriever.FilterRetriever.run_async"></a> 260 261 #### FilterRetriever.run\_async 262 263 ```python 264 @component.output_types(documents=list[Document]) 265 async def run_async(filters: dict[str, Any] | None = None) 266 ``` 267 268 Asynchronously run the FilterRetriever on the given input data. 269 270 **Arguments**: 271 272 - `filters`: A dictionary with filters to narrow down the search space. 273 If not specified, the FilterRetriever uses the values provided at initialization. 274 275 **Returns**: 276 277 A list of retrieved documents. 278 279 <a id="in_memory/bm25_retriever"></a> 280 281 ## Module in\_memory/bm25\_retriever 282 283 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever"></a> 284 285 ### InMemoryBM25Retriever 286 287 Retrieves documents that are most similar to the query using keyword-based algorithm. 288 289 Use this retriever with the InMemoryDocumentStore. 290 291 ### Usage example 292 293 ```python 294 from haystack import Document 295 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 296 from haystack.document_stores.in_memory import InMemoryDocumentStore 297 298 docs = [ 299 Document(content="Python is a popular programming language"), 300 Document(content="python ist eine beliebte Programmiersprache"), 301 ] 302 303 doc_store = InMemoryDocumentStore() 304 doc_store.write_documents(docs) 305 retriever = InMemoryBM25Retriever(doc_store) 306 307 result = retriever.run(query="Programmiersprache") 308 309 print(result["documents"]) 310 ``` 311 312 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.__init__"></a> 313 314 #### InMemoryBM25Retriever.\_\_init\_\_ 315 316 ```python 317 def __init__(document_store: InMemoryDocumentStore, 318 filters: dict[str, Any] | None = None, 319 top_k: int = 10, 320 scale_score: bool = False, 321 filter_policy: FilterPolicy = FilterPolicy.REPLACE) 322 ``` 323 324 Create the InMemoryBM25Retriever component. 325 326 **Arguments**: 327 328 - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents. 329 - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store. 330 - `top_k`: The maximum number of documents to retrieve. 331 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 332 When `False`, uses raw similarity scores. 333 - `filter_policy`: The filter policy to apply during retrieval. 334 Filter policy determines how filters are applied when retrieving documents. You can choose: 335 - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime. 336 Use this policy to dynamically change filtering for specific queries. 337 - `MERGE`: Combines runtime filters with initialization filters to narrow down the search. 338 339 **Raises**: 340 341 - `ValueError`: If the specified `top_k` is not > 0. 342 343 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.to_dict"></a> 344 345 #### InMemoryBM25Retriever.to\_dict 346 347 ```python 348 def to_dict() -> dict[str, Any] 349 ``` 350 351 Serializes the component to a dictionary. 352 353 **Returns**: 354 355 Dictionary with serialized data. 356 357 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.from_dict"></a> 358 359 #### InMemoryBM25Retriever.from\_dict 360 361 ```python 362 @classmethod 363 def from_dict(cls, data: dict[str, Any]) -> "InMemoryBM25Retriever" 364 ``` 365 366 Deserializes the component from a dictionary. 367 368 **Arguments**: 369 370 - `data`: The dictionary to deserialize from. 371 372 **Returns**: 373 374 The deserialized component. 375 376 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run"></a> 377 378 #### InMemoryBM25Retriever.run 379 380 ```python 381 @component.output_types(documents=list[Document]) 382 def run(query: str, 383 filters: dict[str, Any] | None = None, 384 top_k: int | None = None, 385 scale_score: bool | None = None) 386 ``` 387 388 Run the InMemoryBM25Retriever on the given input data. 389 390 **Arguments**: 391 392 - `query`: The query string for the Retriever. 393 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 394 - `top_k`: The maximum number of documents to return. 395 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 396 When `False`, uses raw similarity scores. 397 398 **Raises**: 399 400 - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance. 401 402 **Returns**: 403 404 The retrieved documents. 405 406 <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run_async"></a> 407 408 #### InMemoryBM25Retriever.run\_async 409 410 ```python 411 @component.output_types(documents=list[Document]) 412 async def run_async(query: str, 413 filters: dict[str, Any] | None = None, 414 top_k: int | None = None, 415 scale_score: bool | None = None) 416 ``` 417 418 Run the InMemoryBM25Retriever on the given input data. 419 420 **Arguments**: 421 422 - `query`: The query string for the Retriever. 423 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 424 - `top_k`: The maximum number of documents to return. 425 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 426 When `False`, uses raw similarity scores. 427 428 **Raises**: 429 430 - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance. 431 432 **Returns**: 433 434 The retrieved documents. 435 436 <a id="in_memory/embedding_retriever"></a> 437 438 ## Module in\_memory/embedding\_retriever 439 440 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever"></a> 441 442 ### InMemoryEmbeddingRetriever 443 444 Retrieves documents that are most semantically similar to the query. 445 446 Use this retriever with the InMemoryDocumentStore. 447 448 When using this retriever, make sure it has query and document embeddings available. 449 In indexing pipelines, use a DocumentEmbedder to embed documents. 450 In query pipelines, use a TextEmbedder to embed queries and send them to the retriever. 451 452 ### Usage example 453 ```python 454 from haystack import Document 455 from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder 456 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 457 from haystack.document_stores.in_memory import InMemoryDocumentStore 458 459 docs = [ 460 Document(content="Python is a popular programming language"), 461 Document(content="python ist eine beliebte Programmiersprache"), 462 ] 463 doc_embedder = SentenceTransformersDocumentEmbedder() 464 doc_embedder.warm_up() 465 docs_with_embeddings = doc_embedder.run(docs)["documents"] 466 467 doc_store = InMemoryDocumentStore() 468 doc_store.write_documents(docs_with_embeddings) 469 retriever = InMemoryEmbeddingRetriever(doc_store) 470 471 query="Programmiersprache" 472 text_embedder = SentenceTransformersTextEmbedder() 473 text_embedder.warm_up() 474 query_embedding = text_embedder.run(query)["embedding"] 475 476 result = retriever.run(query_embedding=query_embedding) 477 478 print(result["documents"]) 479 ``` 480 481 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.__init__"></a> 482 483 #### InMemoryEmbeddingRetriever.\_\_init\_\_ 484 485 ```python 486 def __init__(document_store: InMemoryDocumentStore, 487 filters: dict[str, Any] | None = None, 488 top_k: int = 10, 489 scale_score: bool = False, 490 return_embedding: bool = False, 491 filter_policy: FilterPolicy = FilterPolicy.REPLACE) 492 ``` 493 494 Create the InMemoryEmbeddingRetriever component. 495 496 **Arguments**: 497 498 - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents. 499 - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store. 500 - `top_k`: The maximum number of documents to retrieve. 501 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 502 When `False`, uses raw similarity scores. 503 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 504 When `False`, returns just the documents, without their embeddings. 505 - `filter_policy`: The filter policy to apply during retrieval. 506 Filter policy determines how filters are applied when retrieving documents. You can choose: 507 - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime. 508 Use this policy to dynamically change filtering for specific queries. 509 - `MERGE`: Combines runtime filters with initialization filters to narrow down the search. 510 511 **Raises**: 512 513 - `ValueError`: If the specified top_k is not > 0. 514 515 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.to_dict"></a> 516 517 #### InMemoryEmbeddingRetriever.to\_dict 518 519 ```python 520 def to_dict() -> dict[str, Any] 521 ``` 522 523 Serializes the component to a dictionary. 524 525 **Returns**: 526 527 Dictionary with serialized data. 528 529 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.from_dict"></a> 530 531 #### InMemoryEmbeddingRetriever.from\_dict 532 533 ```python 534 @classmethod 535 def from_dict(cls, data: dict[str, Any]) -> "InMemoryEmbeddingRetriever" 536 ``` 537 538 Deserializes the component from a dictionary. 539 540 **Arguments**: 541 542 - `data`: The dictionary to deserialize from. 543 544 **Returns**: 545 546 The deserialized component. 547 548 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run"></a> 549 550 #### InMemoryEmbeddingRetriever.run 551 552 ```python 553 @component.output_types(documents=list[Document]) 554 def run(query_embedding: list[float], 555 filters: dict[str, Any] | None = None, 556 top_k: int | None = None, 557 scale_score: bool | None = None, 558 return_embedding: bool | None = None) 559 ``` 560 561 Run the InMemoryEmbeddingRetriever on the given input data. 562 563 **Arguments**: 564 565 - `query_embedding`: Embedding of the query. 566 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 567 - `top_k`: The maximum number of documents to return. 568 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 569 When `False`, uses raw similarity scores. 570 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 571 When `False`, returns just the documents, without their embeddings. 572 573 **Raises**: 574 575 - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance. 576 577 **Returns**: 578 579 The retrieved documents. 580 581 <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run_async"></a> 582 583 #### InMemoryEmbeddingRetriever.run\_async 584 585 ```python 586 @component.output_types(documents=list[Document]) 587 async def run_async(query_embedding: list[float], 588 filters: dict[str, Any] | None = None, 589 top_k: int | None = None, 590 scale_score: bool | None = None, 591 return_embedding: bool | None = None) 592 ``` 593 594 Run the InMemoryEmbeddingRetriever on the given input data. 595 596 **Arguments**: 597 598 - `query_embedding`: Embedding of the query. 599 - `filters`: A dictionary with filters to narrow down the search space when retrieving documents. 600 - `top_k`: The maximum number of documents to return. 601 - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. 602 When `False`, uses raw similarity scores. 603 - `return_embedding`: When `True`, returns the embedding of the retrieved documents. 604 When `False`, returns just the documents, without their embeddings. 605 606 **Raises**: 607 608 - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance. 609 610 **Returns**: 611 612 The retrieved documents. 613 614 <a id="multi_query_embedding_retriever"></a> 615 616 ## Module multi\_query\_embedding\_retriever 617 618 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever"></a> 619 620 ### MultiQueryEmbeddingRetriever 621 622 A component that retrieves documents using multiple queries in parallel with an embedding-based retriever. 623 624 This component takes a list of text queries, converts them to embeddings using a query embedder, 625 and then uses an embedding-based retriever to find relevant documents for each query in parallel. 626 The results are combined and sorted by relevance score. 627 628 ### Usage example 629 630 ```python 631 from haystack import Document 632 from haystack.document_stores.in_memory import InMemoryDocumentStore 633 from haystack.document_stores.types import DuplicatePolicy 634 from haystack.components.embedders import SentenceTransformersTextEmbedder 635 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 636 from haystack.components.retrievers import InMemoryEmbeddingRetriever 637 from haystack.components.writers import DocumentWriter 638 from haystack.components.retrievers import MultiQueryEmbeddingRetriever 639 640 documents = [ 641 Document(content="Renewable energy is energy that is collected from renewable resources."), 642 Document(content="Solar energy is a type of green energy that is harnessed from the sun."), 643 Document(content="Wind energy is another type of green energy that is generated by wind turbines."), 644 Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."), 645 Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."), 646 Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."), 647 ] 648 649 # Populate the document store 650 doc_store = InMemoryDocumentStore() 651 doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 652 doc_embedder.warm_up() 653 doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP) 654 documents = doc_embedder.run(documents)["documents"] 655 doc_writer.run(documents=documents) 656 657 # Run the multi-query retriever 658 in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1) 659 query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 660 661 multi_query_retriever = MultiQueryEmbeddingRetriever( 662 retriever=in_memory_retriever, 663 query_embedder=query_embedder, 664 max_workers=3 665 ) 666 667 queries = ["Geothermal energy", "natural gas", "turbines"] 668 result = multi_query_retriever.run(queries=queries) 669 for doc in result["documents"]: 670 print(f"Content: {doc.content}, Score: {doc.score}") 671 # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574 672 # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034 673 # >> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354 674 # >> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680 675 # >> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.30914239725622 676 # >> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243 677 ``` 678 679 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.__init__"></a> 680 681 #### MultiQueryEmbeddingRetriever.\_\_init\_\_ 682 683 ```python 684 def __init__(*, 685 retriever: EmbeddingRetriever, 686 query_embedder: TextEmbedder, 687 max_workers: int = 3) -> None 688 ``` 689 690 Initialize MultiQueryEmbeddingRetriever. 691 692 **Arguments**: 693 694 - `retriever`: The embedding-based retriever to use for document retrieval. 695 - `query_embedder`: The query embedder to convert text queries to embeddings. 696 - `max_workers`: Maximum number of worker threads for parallel processing. 697 698 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.warm_up"></a> 699 700 #### MultiQueryEmbeddingRetriever.warm\_up 701 702 ```python 703 def warm_up() -> None 704 ``` 705 706 Warm up the query embedder and the retriever if any has a warm_up method. 707 708 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.run"></a> 709 710 #### MultiQueryEmbeddingRetriever.run 711 712 ```python 713 @component.output_types(documents=list[Document]) 714 def run( 715 queries: list[str], 716 retriever_kwargs: dict[str, Any] | None = None 717 ) -> dict[str, list[Document]] 718 ``` 719 720 Retrieve documents using multiple queries in parallel. 721 722 **Arguments**: 723 724 - `queries`: List of text queries to process. 725 - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method. 726 727 **Returns**: 728 729 A dictionary containing: 730 - `documents`: List of retrieved documents sorted by relevance score. 731 732 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.to_dict"></a> 733 734 #### MultiQueryEmbeddingRetriever.to\_dict 735 736 ```python 737 def to_dict() -> dict[str, Any] 738 ``` 739 740 Serializes the component to a dictionary. 741 742 **Returns**: 743 744 A dictionary representing the serialized component. 745 746 <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.from_dict"></a> 747 748 #### MultiQueryEmbeddingRetriever.from\_dict 749 750 ```python 751 @classmethod 752 def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever" 753 ``` 754 755 Deserializes the component from a dictionary. 756 757 **Arguments**: 758 759 - `data`: The dictionary to deserialize from. 760 761 **Returns**: 762 763 The deserialized component. 764 765 <a id="multi_query_text_retriever"></a> 766 767 ## Module multi\_query\_text\_retriever 768 769 <a id="multi_query_text_retriever.MultiQueryTextRetriever"></a> 770 771 ### MultiQueryTextRetriever 772 773 A component that retrieves documents using multiple queries in parallel with a text-based retriever. 774 775 This component takes a list of text queries and uses a text-based retriever to find relevant documents for each 776 query in parallel, using a thread pool to manage concurrent execution. The results are combined and sorted by 777 relevance score. 778 779 You can use this component in combination with QueryExpander component to enhance the retrieval process. 780 781 ### Usage example 782 ```python 783 from haystack import Document 784 from haystack.components.writers import DocumentWriter 785 from haystack.document_stores.in_memory import InMemoryDocumentStore 786 from haystack.document_stores.types import DuplicatePolicy 787 from haystack.components.retrievers import InMemoryBM25Retriever 788 from haystack.components.query import QueryExpander 789 from haystack.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever 790 791 documents = [ 792 Document(content="Renewable energy is energy that is collected from renewable resources."), 793 Document(content="Solar energy is a type of green energy that is harnessed from the sun."), 794 Document(content="Wind energy is another type of green energy that is generated by wind turbines."), 795 Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."), 796 Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.") 797 ] 798 799 document_store = InMemoryDocumentStore() 800 doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP) 801 doc_writer.run(documents=documents) 802 803 in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1) 804 multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever) 805 results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"]) 806 for doc in results["documents"]: 807 print(f"Content: {doc.content}, Score: {doc.score}") 808 # >> 809 # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097 810 # >> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.615 811 # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944 812 ``` 813 814 <a id="multi_query_text_retriever.MultiQueryTextRetriever.__init__"></a> 815 816 #### MultiQueryTextRetriever.\_\_init\_\_ 817 818 ```python 819 def __init__(*, retriever: TextRetriever, max_workers: int = 3) -> None 820 ``` 821 822 Initialize MultiQueryTextRetriever. 823 824 **Arguments**: 825 826 - `retriever`: The text-based retriever to use for document retrieval. 827 - `max_workers`: Maximum number of worker threads for parallel processing. Default is 3. 828 829 <a id="multi_query_text_retriever.MultiQueryTextRetriever.warm_up"></a> 830 831 #### MultiQueryTextRetriever.warm\_up 832 833 ```python 834 def warm_up() -> None 835 ``` 836 837 Warm up the retriever if it has a warm_up method. 838 839 <a id="multi_query_text_retriever.MultiQueryTextRetriever.run"></a> 840 841 #### MultiQueryTextRetriever.run 842 843 ```python 844 @component.output_types(documents=list[Document]) 845 def run( 846 queries: list[str], 847 retriever_kwargs: dict[str, Any] | None = None 848 ) -> dict[str, list[Document]] 849 ``` 850 851 Retrieve documents using multiple queries in parallel. 852 853 **Arguments**: 854 855 - `queries`: List of text queries to process. 856 - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method. 857 858 **Returns**: 859 860 A dictionary containing: 861 `documents`: List of retrieved documents sorted by relevance score. 862 863 <a id="multi_query_text_retriever.MultiQueryTextRetriever.to_dict"></a> 864 865 #### MultiQueryTextRetriever.to\_dict 866 867 ```python 868 def to_dict() -> dict[str, Any] 869 ``` 870 871 Serializes the component to a dictionary. 872 873 **Returns**: 874 875 The serialized component as a dictionary. 876 877 <a id="multi_query_text_retriever.MultiQueryTextRetriever.from_dict"></a> 878 879 #### MultiQueryTextRetriever.from\_dict 880 881 ```python 882 @classmethod 883 def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever" 884 ``` 885 886 Deserializes the component from a dictionary. 887 888 **Arguments**: 889 890 - `data`: The dictionary to deserialize from. 891 892 **Returns**: 893 894 The deserialized component. 895 896 <a id="sentence_window_retriever"></a> 897 898 ## Module sentence\_window\_retriever 899 900 <a id="sentence_window_retriever.SentenceWindowRetriever"></a> 901 902 ### SentenceWindowRetriever 903 904 Retrieves neighboring documents from a DocumentStore to provide context for query results. 905 906 This component is intended to be used after a Retriever (e.g., BM25Retriever, EmbeddingRetriever). 907 It enhances retrieved results by fetching adjacent document chunks to give 908 additional context for the user. 909 910 The documents must include metadata indicating their origin and position: 911 - `source_id` is used to group sentence chunks belonging to the same original document. 912 - `split_id` represents the position/order of the chunk within the document. 913 914 The number of adjacent documents to include on each side of the retrieved document can be configured using the 915 `window_size` parameter. You can also specify which metadata fields to use for source and split ID 916 via `source_id_meta_field` and `split_id_meta_field`. 917 918 The SentenceWindowRetriever is compatible with the following DocumentStores: 919 - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore) 920 - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store) 921 - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store) 922 - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) 923 - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store) 924 - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store) 925 926 ### Usage example 927 928 ```python 929 from haystack import Document, Pipeline 930 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 931 from haystack.components.retrievers import SentenceWindowRetriever 932 from haystack.components.preprocessors import DocumentSplitter 933 from haystack.document_stores.in_memory import InMemoryDocumentStore 934 935 splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word") 936 text = ( 937 "This is a text with some words. There is a second sentence. And there is also a third sentence. " 938 "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence" 939 ) 940 doc = Document(content=text) 941 docs = splitter.run([doc]) 942 doc_store = InMemoryDocumentStore() 943 doc_store.write_documents(docs["documents"]) 944 945 946 rag = Pipeline() 947 rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1)) 948 rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=2)) 949 rag.connect("bm25_retriever", "sentence_window_retriever") 950 951 rag.run({'bm25_retriever': {"query":"third"}}) 952 953 # >> {'sentence_window_retriever': {'context_windows': ['some words. There is a second sentence. 954 # >> And there is also a third sentence. It also contains a fourth sentence. And a fifth sentence. And a sixth 955 # >> sentence. And a'], 'context_documents': [[Document(id=..., content: 'some words. There is a second sentence. 956 # >> And there is ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 1, 'split_idx_start': 20, 957 # >> '_split_overlap': [{'doc_id': '...', 'range': (20, 43)}, {'doc_id': '...', 'range': (0, 30)}]}), 958 # >> Document(id=..., content: 'second sentence. And there is also a third sentence. It ', 959 # >> meta: {'source_id': '74ea87deb38012873cf8c07e...f19d01a26a098447113e1d7b83efd30c02987114', 'page_number': 1, 960 # >> 'split_id': 2, 'split_idx_start': 43, '_split_overlap': [{'doc_id': '...', 'range': (23, 53)}, {'doc_id': '.', 961 # >> 'range': (0, 26)}]}), Document(id=..., content: 'also a third sentence. It also contains a fourth sentence. ', 962 # >> meta: {'source_id': '...', 'page_number': 1, 'split_id': 3, 'split_idx_start': 73, '_split_overlap': 963 # >> [{'doc_id': '...', 'range': (30, 56)}, {'doc_id': '...', 'range': (0, 33)}]}), Document(id=..., content: 964 # >> 'also contains a fourth sentence. And a fifth sentence. And ', meta: {'source_id': '...', 'page_number': 1, 965 # >> 'split_id': 4, 'split_idx_start': 99, '_split_overlap': [{'doc_id': '...', 'range': (26, 59)}, 966 # >> {'doc_id': '...', 'range': (0, 26)}]}), Document(id=..., content: 'And a fifth sentence. And a sixth sentence. 967 # >> And a ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 5, 'split_idx_start': 132, 968 # >> '_split_overlap': [{'doc_id': '...', 'range': (33, 59)}, {'doc_id': '...', 'range': (0, 24)}]})]]}}}} 969 ``` 970 971 <a id="sentence_window_retriever.SentenceWindowRetriever.__init__"></a> 972 973 #### SentenceWindowRetriever.\_\_init\_\_ 974 975 ```python 976 def __init__(document_store: DocumentStore, 977 window_size: int = 3, 978 *, 979 source_id_meta_field: str | list[str] = "source_id", 980 split_id_meta_field: str = "split_id", 981 raise_on_missing_meta_fields: bool = True) 982 ``` 983 984 Creates a new SentenceWindowRetriever component. 985 986 **Arguments**: 987 988 - `document_store`: The Document Store to retrieve the surrounding documents from. 989 - `window_size`: The number of documents to retrieve before and after the relevant one. 990 For example, `window_size: 2` fetches 2 preceding and 2 following documents. 991 - `source_id_meta_field`: The metadata field that contains the source ID of the document. 992 This can be a single field or a list of fields. If multiple fields are provided, the retriever will 993 consider the document as part of the same source if all the fields match. 994 - `split_id_meta_field`: The metadata field that contains the split ID of the document. 995 - `raise_on_missing_meta_fields`: If True, raises an error if the documents do not contain the required 996 metadata fields. If False, it will skip retrieving the context for documents that are missing 997 the required metadata fields, but will still include the original document in the results. 998 999 <a id="sentence_window_retriever.SentenceWindowRetriever.merge_documents_text"></a> 1000 1001 #### SentenceWindowRetriever.merge\_documents\_text 1002 1003 ```python 1004 @staticmethod 1005 def merge_documents_text(documents: list[Document]) -> str 1006 ``` 1007 1008 Merge a list of document text into a single string. 1009 1010 This functions concatenates the textual content of a list of documents into a single string, eliminating any 1011 overlapping content. 1012 1013 **Arguments**: 1014 1015 - `documents`: List of Documents to merge. 1016 1017 <a id="sentence_window_retriever.SentenceWindowRetriever.to_dict"></a> 1018 1019 #### SentenceWindowRetriever.to\_dict 1020 1021 ```python 1022 def to_dict() -> dict[str, Any] 1023 ``` 1024 1025 Serializes the component to a dictionary. 1026 1027 **Returns**: 1028 1029 Dictionary with serialized data. 1030 1031 <a id="sentence_window_retriever.SentenceWindowRetriever.from_dict"></a> 1032 1033 #### SentenceWindowRetriever.from\_dict 1034 1035 ```python 1036 @classmethod 1037 def from_dict(cls, data: dict[str, Any]) -> "SentenceWindowRetriever" 1038 ``` 1039 1040 Deserializes the component from a dictionary. 1041 1042 **Returns**: 1043 1044 Deserialized component. 1045 1046 <a id="sentence_window_retriever.SentenceWindowRetriever.run"></a> 1047 1048 #### SentenceWindowRetriever.run 1049 1050 ```python 1051 @component.output_types(context_windows=list[str], 1052 context_documents=list[Document]) 1053 def run(retrieved_documents: list[Document], window_size: int | None = None) 1054 ``` 1055 1056 Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store. 1057 1058 Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given 1059 document from the document store. 1060 1061 **Arguments**: 1062 1063 - `retrieved_documents`: List of retrieved documents from the previous retriever. 1064 - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite 1065 the `window_size` parameter set in the constructor. 1066 1067 **Returns**: 1068 1069 A dictionary with the following keys: 1070 - `context_windows`: A list of strings, where each string represents the concatenated text from the 1071 context window of the corresponding document in `retrieved_documents`. 1072 - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context 1073 document surrounding them. The documents are sorted by the `split_idx_start` 1074 meta field. 1075 1076 <a id="sentence_window_retriever.SentenceWindowRetriever.run_async"></a> 1077 1078 #### SentenceWindowRetriever.run\_async 1079 1080 ```python 1081 @component.output_types(context_windows=list[str], 1082 context_documents=list[Document]) 1083 async def run_async(retrieved_documents: list[Document], 1084 window_size: int | None = None) 1085 ``` 1086 1087 Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store. 1088 1089 Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given 1090 document from the document store. 1091 1092 **Arguments**: 1093 1094 - `retrieved_documents`: List of retrieved documents from the previous retriever. 1095 - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite 1096 the `window_size` parameter set in the constructor. 1097 1098 **Returns**: 1099 1100 A dictionary with the following keys: 1101 - `context_windows`: A list of strings, where each string represents the concatenated text from the 1102 context window of the corresponding document in `retrieved_documents`. 1103 - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context 1104 document surrounding them. The documents are sorted by the `split_idx_start` 1105 meta field. 1106