Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.22 / haystack-api / retrievers_api.md
retrievers_api.md
   1  ---
   2  title: "Retrievers"
   3  id: retrievers-api
   4  description: "Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query."
   5  slug: "/retrievers-api"
   6  ---
   7  
   8  <a id="auto_merging_retriever"></a>
   9  
  10  ## Module auto\_merging\_retriever
  11  
  12  <a id="auto_merging_retriever.AutoMergingRetriever"></a>
  13  
  14  ### AutoMergingRetriever
  15  
  16  A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.
  17  
  18  The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes
  19  are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create
  20  such a structure. During retrieval, if the number of matched leaf documents below the same parent is
  21  higher than a defined threshold, the retriever will return the parent document instead of the individual leaf
  22  documents.
  23  
  24  The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for
  25  a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual
  26  chunks alone.
  27  
  28  Currently the AutoMergingRetriever can only be used by the following DocumentStores:
  29  - [AstraDB](https://haystack.deepset.ai/integrations/astradb)
  30  - [ElasticSearch](https://haystack.deepset.ai/docs/latest/documentstore/elasticsearch)
  31  - [OpenSearch](https://haystack.deepset.ai/docs/latest/documentstore/opensearch)
  32  - [PGVector](https://haystack.deepset.ai/docs/latest/documentstore/pgvector)
  33  - [Qdrant](https://haystack.deepset.ai/docs/latest/documentstore/qdrant)
  34  
  35  ```python
  36  from haystack import Document
  37  from haystack.components.preprocessors import HierarchicalDocumentSplitter
  38  from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
  39  from haystack.document_stores.in_memory import InMemoryDocumentStore
  40  
  41  # create a hierarchical document structure with 3 levels, where the parent document has 3 children
  42  text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
  43  original_document = Document(content=text)
  44  builder = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
  45  docs = builder.run([original_document])["documents"]
  46  
  47  # store level-1 parent documents and initialize the retriever
  48  doc_store_parents = InMemoryDocumentStore()
  49  for doc in docs:
  50      if doc.meta["__children_ids"] and doc.meta["__level"] in [0,1]:  # store the root document and level 1 documents
  51          doc_store_parents.write_documents([doc])
  52  
  53  retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
  54  
  55  # assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
  56  # since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
  57  leaf_docs = [doc for doc in docs if not doc.meta["__children_ids"]]
  58  retrieved_docs = retriever.run(leaf_docs[4:6])
  59  print(retrieved_docs["documents"])
  60  # [Document(id=538..),
  61  # content: 'warm glow over the trees. Birds began to sing.',
  62  # meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
  63  # 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
  64  ```
  65  
  66  <a id="auto_merging_retriever.AutoMergingRetriever.__init__"></a>
  67  
  68  #### AutoMergingRetriever.\_\_init\_\_
  69  
  70  ```python
  71  def __init__(document_store: DocumentStore, threshold: float = 0.5)
  72  ```
  73  
  74  Initialize the AutoMergingRetriever.
  75  
  76  **Arguments**:
  77  
  78  - `document_store`: DocumentStore from which to retrieve the parent documents
  79  - `threshold`: Threshold to decide whether the parent instead of the individual documents is returned
  80  
  81  <a id="auto_merging_retriever.AutoMergingRetriever.to_dict"></a>
  82  
  83  #### AutoMergingRetriever.to\_dict
  84  
  85  ```python
  86  def to_dict() -> dict[str, Any]
  87  ```
  88  
  89  Serializes the component to a dictionary.
  90  
  91  **Returns**:
  92  
  93  Dictionary with serialized data.
  94  
  95  <a id="auto_merging_retriever.AutoMergingRetriever.from_dict"></a>
  96  
  97  #### AutoMergingRetriever.from\_dict
  98  
  99  ```python
 100  @classmethod
 101  def from_dict(cls, data: dict[str, Any]) -> "AutoMergingRetriever"
 102  ```
 103  
 104  Deserializes the component from a dictionary.
 105  
 106  **Arguments**:
 107  
 108  - `data`: Dictionary with serialized data.
 109  
 110  **Returns**:
 111  
 112  An instance of the component.
 113  
 114  <a id="auto_merging_retriever.AutoMergingRetriever.run"></a>
 115  
 116  #### AutoMergingRetriever.run
 117  
 118  ```python
 119  @component.output_types(documents=list[Document])
 120  def run(documents: list[Document])
 121  ```
 122  
 123  Run the AutoMergingRetriever.
 124  
 125  Recursively groups documents by their parents and merges them if they meet the threshold,
 126  continuing up the hierarchy until no more merges are possible.
 127  
 128  **Arguments**:
 129  
 130  - `documents`: List of leaf documents that were matched by a retriever
 131  
 132  **Returns**:
 133  
 134  List of documents (could be a mix of different hierarchy levels)
 135  
 136  <a id="auto_merging_retriever.AutoMergingRetriever.run_async"></a>
 137  
 138  #### AutoMergingRetriever.run\_async
 139  
 140  ```python
 141  @component.output_types(documents=list[Document])
 142  async def run_async(documents: list[Document])
 143  ```
 144  
 145  Asynchronously run the AutoMergingRetriever.
 146  
 147  Recursively groups documents by their parents and merges them if they meet the threshold,
 148  continuing up the hierarchy until no more merges are possible.
 149  
 150  **Arguments**:
 151  
 152  - `documents`: List of leaf documents that were matched by a retriever
 153  
 154  **Returns**:
 155  
 156  List of documents (could be a mix of different hierarchy levels)
 157  
 158  <a id="filter_retriever"></a>
 159  
 160  ## Module filter\_retriever
 161  
 162  <a id="filter_retriever.FilterRetriever"></a>
 163  
 164  ### FilterRetriever
 165  
 166  Retrieves documents that match the provided filters.
 167  
 168  ### Usage example
 169  
 170  ```python
 171  from haystack import Document
 172  from haystack.components.retrievers import FilterRetriever
 173  from haystack.document_stores.in_memory import InMemoryDocumentStore
 174  
 175  docs = [
 176      Document(content="Python is a popular programming language", meta={"lang": "en"}),
 177      Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}),
 178  ]
 179  
 180  doc_store = InMemoryDocumentStore()
 181  doc_store.write_documents(docs)
 182  retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"})
 183  
 184  # if passed in the run method, filters override those provided at initialization
 185  result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"})
 186  
 187  print(result["documents"])
 188  ```
 189  
 190  <a id="filter_retriever.FilterRetriever.__init__"></a>
 191  
 192  #### FilterRetriever.\_\_init\_\_
 193  
 194  ```python
 195  def __init__(document_store: DocumentStore,
 196               filters: dict[str, Any] | None = None)
 197  ```
 198  
 199  Create the FilterRetriever component.
 200  
 201  **Arguments**:
 202  
 203  - `document_store`: An instance of a Document Store to use with the Retriever.
 204  - `filters`: A dictionary with filters to narrow down the search space.
 205  
 206  <a id="filter_retriever.FilterRetriever.to_dict"></a>
 207  
 208  #### FilterRetriever.to\_dict
 209  
 210  ```python
 211  def to_dict() -> dict[str, Any]
 212  ```
 213  
 214  Serializes the component to a dictionary.
 215  
 216  **Returns**:
 217  
 218  Dictionary with serialized data.
 219  
 220  <a id="filter_retriever.FilterRetriever.from_dict"></a>
 221  
 222  #### FilterRetriever.from\_dict
 223  
 224  ```python
 225  @classmethod
 226  def from_dict(cls, data: dict[str, Any]) -> "FilterRetriever"
 227  ```
 228  
 229  Deserializes the component from a dictionary.
 230  
 231  **Arguments**:
 232  
 233  - `data`: The dictionary to deserialize from.
 234  
 235  **Returns**:
 236  
 237  The deserialized component.
 238  
 239  <a id="filter_retriever.FilterRetriever.run"></a>
 240  
 241  #### FilterRetriever.run
 242  
 243  ```python
 244  @component.output_types(documents=list[Document])
 245  def run(filters: dict[str, Any] | None = None)
 246  ```
 247  
 248  Run the FilterRetriever on the given input data.
 249  
 250  **Arguments**:
 251  
 252  - `filters`: A dictionary with filters to narrow down the search space.
 253  If not specified, the FilterRetriever uses the values provided at initialization.
 254  
 255  **Returns**:
 256  
 257  A list of retrieved documents.
 258  
 259  <a id="filter_retriever.FilterRetriever.run_async"></a>
 260  
 261  #### FilterRetriever.run\_async
 262  
 263  ```python
 264  @component.output_types(documents=list[Document])
 265  async def run_async(filters: dict[str, Any] | None = None)
 266  ```
 267  
 268  Asynchronously run the FilterRetriever on the given input data.
 269  
 270  **Arguments**:
 271  
 272  - `filters`: A dictionary with filters to narrow down the search space.
 273  If not specified, the FilterRetriever uses the values provided at initialization.
 274  
 275  **Returns**:
 276  
 277  A list of retrieved documents.
 278  
 279  <a id="in_memory/bm25_retriever"></a>
 280  
 281  ## Module in\_memory/bm25\_retriever
 282  
 283  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever"></a>
 284  
 285  ### InMemoryBM25Retriever
 286  
 287  Retrieves documents that are most similar to the query using keyword-based algorithm.
 288  
 289  Use this retriever with the InMemoryDocumentStore.
 290  
 291  ### Usage example
 292  
 293  ```python
 294  from haystack import Document
 295  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 296  from haystack.document_stores.in_memory import InMemoryDocumentStore
 297  
 298  docs = [
 299      Document(content="Python is a popular programming language"),
 300      Document(content="python ist eine beliebte Programmiersprache"),
 301  ]
 302  
 303  doc_store = InMemoryDocumentStore()
 304  doc_store.write_documents(docs)
 305  retriever = InMemoryBM25Retriever(doc_store)
 306  
 307  result = retriever.run(query="Programmiersprache")
 308  
 309  print(result["documents"])
 310  ```
 311  
 312  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.__init__"></a>
 313  
 314  #### InMemoryBM25Retriever.\_\_init\_\_
 315  
 316  ```python
 317  def __init__(document_store: InMemoryDocumentStore,
 318               filters: dict[str, Any] | None = None,
 319               top_k: int = 10,
 320               scale_score: bool = False,
 321               filter_policy: FilterPolicy = FilterPolicy.REPLACE)
 322  ```
 323  
 324  Create the InMemoryBM25Retriever component.
 325  
 326  **Arguments**:
 327  
 328  - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.
 329  - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store.
 330  - `top_k`: The maximum number of documents to retrieve.
 331  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 332  When `False`, uses raw similarity scores.
 333  - `filter_policy`: The filter policy to apply during retrieval.
 334  Filter policy determines how filters are applied when retrieving documents. You can choose:
 335  - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime.
 336  Use this policy to dynamically change filtering for specific queries.
 337  - `MERGE`: Combines runtime filters with initialization filters to narrow down the search.
 338  
 339  **Raises**:
 340  
 341  - `ValueError`: If the specified `top_k` is not > 0.
 342  
 343  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.to_dict"></a>
 344  
 345  #### InMemoryBM25Retriever.to\_dict
 346  
 347  ```python
 348  def to_dict() -> dict[str, Any]
 349  ```
 350  
 351  Serializes the component to a dictionary.
 352  
 353  **Returns**:
 354  
 355  Dictionary with serialized data.
 356  
 357  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.from_dict"></a>
 358  
 359  #### InMemoryBM25Retriever.from\_dict
 360  
 361  ```python
 362  @classmethod
 363  def from_dict(cls, data: dict[str, Any]) -> "InMemoryBM25Retriever"
 364  ```
 365  
 366  Deserializes the component from a dictionary.
 367  
 368  **Arguments**:
 369  
 370  - `data`: The dictionary to deserialize from.
 371  
 372  **Returns**:
 373  
 374  The deserialized component.
 375  
 376  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run"></a>
 377  
 378  #### InMemoryBM25Retriever.run
 379  
 380  ```python
 381  @component.output_types(documents=list[Document])
 382  def run(query: str,
 383          filters: dict[str, Any] | None = None,
 384          top_k: int | None = None,
 385          scale_score: bool | None = None)
 386  ```
 387  
 388  Run the InMemoryBM25Retriever on the given input data.
 389  
 390  **Arguments**:
 391  
 392  - `query`: The query string for the Retriever.
 393  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 394  - `top_k`: The maximum number of documents to return.
 395  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 396  When `False`, uses raw similarity scores.
 397  
 398  **Raises**:
 399  
 400  - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
 401  
 402  **Returns**:
 403  
 404  The retrieved documents.
 405  
 406  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run_async"></a>
 407  
 408  #### InMemoryBM25Retriever.run\_async
 409  
 410  ```python
 411  @component.output_types(documents=list[Document])
 412  async def run_async(query: str,
 413                      filters: dict[str, Any] | None = None,
 414                      top_k: int | None = None,
 415                      scale_score: bool | None = None)
 416  ```
 417  
 418  Run the InMemoryBM25Retriever on the given input data.
 419  
 420  **Arguments**:
 421  
 422  - `query`: The query string for the Retriever.
 423  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 424  - `top_k`: The maximum number of documents to return.
 425  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 426  When `False`, uses raw similarity scores.
 427  
 428  **Raises**:
 429  
 430  - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
 431  
 432  **Returns**:
 433  
 434  The retrieved documents.
 435  
 436  <a id="in_memory/embedding_retriever"></a>
 437  
 438  ## Module in\_memory/embedding\_retriever
 439  
 440  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever"></a>
 441  
 442  ### InMemoryEmbeddingRetriever
 443  
 444  Retrieves documents that are most semantically similar to the query.
 445  
 446  Use this retriever with the InMemoryDocumentStore.
 447  
 448  When using this retriever, make sure it has query and document embeddings available.
 449  In indexing pipelines, use a DocumentEmbedder to embed documents.
 450  In query pipelines, use a TextEmbedder to embed queries and send them to the retriever.
 451  
 452  ### Usage example
 453  ```python
 454  from haystack import Document
 455  from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
 456  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
 457  from haystack.document_stores.in_memory import InMemoryDocumentStore
 458  
 459  docs = [
 460      Document(content="Python is a popular programming language"),
 461      Document(content="python ist eine beliebte Programmiersprache"),
 462  ]
 463  doc_embedder = SentenceTransformersDocumentEmbedder()
 464  doc_embedder.warm_up()
 465  docs_with_embeddings = doc_embedder.run(docs)["documents"]
 466  
 467  doc_store = InMemoryDocumentStore()
 468  doc_store.write_documents(docs_with_embeddings)
 469  retriever = InMemoryEmbeddingRetriever(doc_store)
 470  
 471  query="Programmiersprache"
 472  text_embedder = SentenceTransformersTextEmbedder()
 473  text_embedder.warm_up()
 474  query_embedding = text_embedder.run(query)["embedding"]
 475  
 476  result = retriever.run(query_embedding=query_embedding)
 477  
 478  print(result["documents"])
 479  ```
 480  
 481  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.__init__"></a>
 482  
 483  #### InMemoryEmbeddingRetriever.\_\_init\_\_
 484  
 485  ```python
 486  def __init__(document_store: InMemoryDocumentStore,
 487               filters: dict[str, Any] | None = None,
 488               top_k: int = 10,
 489               scale_score: bool = False,
 490               return_embedding: bool = False,
 491               filter_policy: FilterPolicy = FilterPolicy.REPLACE)
 492  ```
 493  
 494  Create the InMemoryEmbeddingRetriever component.
 495  
 496  **Arguments**:
 497  
 498  - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.
 499  - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store.
 500  - `top_k`: The maximum number of documents to retrieve.
 501  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 502  When `False`, uses raw similarity scores.
 503  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 504  When `False`, returns just the documents, without their embeddings.
 505  - `filter_policy`: The filter policy to apply during retrieval.
 506  Filter policy determines how filters are applied when retrieving documents. You can choose:
 507  - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime.
 508  Use this policy to dynamically change filtering for specific queries.
 509  - `MERGE`: Combines runtime filters with initialization filters to narrow down the search.
 510  
 511  **Raises**:
 512  
 513  - `ValueError`: If the specified top_k is not > 0.
 514  
 515  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.to_dict"></a>
 516  
 517  #### InMemoryEmbeddingRetriever.to\_dict
 518  
 519  ```python
 520  def to_dict() -> dict[str, Any]
 521  ```
 522  
 523  Serializes the component to a dictionary.
 524  
 525  **Returns**:
 526  
 527  Dictionary with serialized data.
 528  
 529  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.from_dict"></a>
 530  
 531  #### InMemoryEmbeddingRetriever.from\_dict
 532  
 533  ```python
 534  @classmethod
 535  def from_dict(cls, data: dict[str, Any]) -> "InMemoryEmbeddingRetriever"
 536  ```
 537  
 538  Deserializes the component from a dictionary.
 539  
 540  **Arguments**:
 541  
 542  - `data`: The dictionary to deserialize from.
 543  
 544  **Returns**:
 545  
 546  The deserialized component.
 547  
 548  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run"></a>
 549  
 550  #### InMemoryEmbeddingRetriever.run
 551  
 552  ```python
 553  @component.output_types(documents=list[Document])
 554  def run(query_embedding: list[float],
 555          filters: dict[str, Any] | None = None,
 556          top_k: int | None = None,
 557          scale_score: bool | None = None,
 558          return_embedding: bool | None = None)
 559  ```
 560  
 561  Run the InMemoryEmbeddingRetriever on the given input data.
 562  
 563  **Arguments**:
 564  
 565  - `query_embedding`: Embedding of the query.
 566  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 567  - `top_k`: The maximum number of documents to return.
 568  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 569  When `False`, uses raw similarity scores.
 570  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 571  When `False`, returns just the documents, without their embeddings.
 572  
 573  **Raises**:
 574  
 575  - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
 576  
 577  **Returns**:
 578  
 579  The retrieved documents.
 580  
 581  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run_async"></a>
 582  
 583  #### InMemoryEmbeddingRetriever.run\_async
 584  
 585  ```python
 586  @component.output_types(documents=list[Document])
 587  async def run_async(query_embedding: list[float],
 588                      filters: dict[str, Any] | None = None,
 589                      top_k: int | None = None,
 590                      scale_score: bool | None = None,
 591                      return_embedding: bool | None = None)
 592  ```
 593  
 594  Run the InMemoryEmbeddingRetriever on the given input data.
 595  
 596  **Arguments**:
 597  
 598  - `query_embedding`: Embedding of the query.
 599  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 600  - `top_k`: The maximum number of documents to return.
 601  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 602  When `False`, uses raw similarity scores.
 603  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 604  When `False`, returns just the documents, without their embeddings.
 605  
 606  **Raises**:
 607  
 608  - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
 609  
 610  **Returns**:
 611  
 612  The retrieved documents.
 613  
 614  <a id="multi_query_embedding_retriever"></a>
 615  
 616  ## Module multi\_query\_embedding\_retriever
 617  
 618  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever"></a>
 619  
 620  ### MultiQueryEmbeddingRetriever
 621  
 622  A component that retrieves documents using multiple queries in parallel with an embedding-based retriever.
 623  
 624  This component takes a list of text queries, converts them to embeddings using a query embedder,
 625  and then uses an embedding-based retriever to find relevant documents for each query in parallel.
 626  The results are combined and sorted by relevance score.
 627  
 628  ### Usage example
 629  
 630  ```python
 631  from haystack import Document
 632  from haystack.document_stores.in_memory import InMemoryDocumentStore
 633  from haystack.document_stores.types import DuplicatePolicy
 634  from haystack.components.embedders import SentenceTransformersTextEmbedder
 635  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
 636  from haystack.components.retrievers import InMemoryEmbeddingRetriever
 637  from haystack.components.writers import DocumentWriter
 638  from haystack.components.retrievers import MultiQueryEmbeddingRetriever
 639  
 640  documents = [
 641      Document(content="Renewable energy is energy that is collected from renewable resources."),
 642      Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
 643      Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
 644      Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."),
 645      Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."),
 646      Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."),
 647  ]
 648  
 649  # Populate the document store
 650  doc_store = InMemoryDocumentStore()
 651  doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
 652  doc_embedder.warm_up()
 653  doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP)
 654  documents = doc_embedder.run(documents)["documents"]
 655  doc_writer.run(documents=documents)
 656  
 657  # Run the multi-query retriever
 658  in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1)
 659  query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
 660  
 661  multi_query_retriever = MultiQueryEmbeddingRetriever(
 662      retriever=in_memory_retriever,
 663      query_embedder=query_embedder,
 664      max_workers=3
 665  )
 666  
 667  queries = ["Geothermal energy", "natural gas", "turbines"]
 668  result = multi_query_retriever.run(queries=queries)
 669  for doc in result["documents"]:
 670      print(f"Content: {doc.content}, Score: {doc.score}")
 671  # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574
 672  # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034
 673  # >> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354
 674  # >> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680
 675  # >> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.30914239725622
 676  # >> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243
 677  ```
 678  
 679  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.__init__"></a>
 680  
 681  #### MultiQueryEmbeddingRetriever.\_\_init\_\_
 682  
 683  ```python
 684  def __init__(*,
 685               retriever: EmbeddingRetriever,
 686               query_embedder: TextEmbedder,
 687               max_workers: int = 3) -> None
 688  ```
 689  
 690  Initialize MultiQueryEmbeddingRetriever.
 691  
 692  **Arguments**:
 693  
 694  - `retriever`: The embedding-based retriever to use for document retrieval.
 695  - `query_embedder`: The query embedder to convert text queries to embeddings.
 696  - `max_workers`: Maximum number of worker threads for parallel processing.
 697  
 698  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.warm_up"></a>
 699  
 700  #### MultiQueryEmbeddingRetriever.warm\_up
 701  
 702  ```python
 703  def warm_up() -> None
 704  ```
 705  
 706  Warm up the query embedder and the retriever if any has a warm_up method.
 707  
 708  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.run"></a>
 709  
 710  #### MultiQueryEmbeddingRetriever.run
 711  
 712  ```python
 713  @component.output_types(documents=list[Document])
 714  def run(
 715      queries: list[str],
 716      retriever_kwargs: dict[str, Any] | None = None
 717  ) -> dict[str, list[Document]]
 718  ```
 719  
 720  Retrieve documents using multiple queries in parallel.
 721  
 722  **Arguments**:
 723  
 724  - `queries`: List of text queries to process.
 725  - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method.
 726  
 727  **Returns**:
 728  
 729  A dictionary containing:
 730  - `documents`: List of retrieved documents sorted by relevance score.
 731  
 732  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.to_dict"></a>
 733  
 734  #### MultiQueryEmbeddingRetriever.to\_dict
 735  
 736  ```python
 737  def to_dict() -> dict[str, Any]
 738  ```
 739  
 740  Serializes the component to a dictionary.
 741  
 742  **Returns**:
 743  
 744  A dictionary representing the serialized component.
 745  
 746  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.from_dict"></a>
 747  
 748  #### MultiQueryEmbeddingRetriever.from\_dict
 749  
 750  ```python
 751  @classmethod
 752  def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever"
 753  ```
 754  
 755  Deserializes the component from a dictionary.
 756  
 757  **Arguments**:
 758  
 759  - `data`: The dictionary to deserialize from.
 760  
 761  **Returns**:
 762  
 763  The deserialized component.
 764  
 765  <a id="multi_query_text_retriever"></a>
 766  
 767  ## Module multi\_query\_text\_retriever
 768  
 769  <a id="multi_query_text_retriever.MultiQueryTextRetriever"></a>
 770  
 771  ### MultiQueryTextRetriever
 772  
 773  A component that retrieves documents using multiple queries in parallel with a text-based retriever.
 774  
 775  This component takes a list of text queries and uses a text-based retriever to find relevant documents for each
 776  query in parallel, using a thread pool to manage concurrent execution. The results are combined and sorted by
 777  relevance score.
 778  
 779  You can use this component in combination with QueryExpander component to enhance the retrieval process.
 780  
 781  ### Usage example
 782  ```python
 783  from haystack import Document
 784  from haystack.components.writers import DocumentWriter
 785  from haystack.document_stores.in_memory import InMemoryDocumentStore
 786  from haystack.document_stores.types import DuplicatePolicy
 787  from haystack.components.retrievers import InMemoryBM25Retriever
 788  from haystack.components.query import QueryExpander
 789  from haystack.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever
 790  
 791  documents = [
 792      Document(content="Renewable energy is energy that is collected from renewable resources."),
 793      Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
 794      Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
 795      Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."),
 796      Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.")
 797  ]
 798  
 799  document_store = InMemoryDocumentStore()
 800  doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
 801  doc_writer.run(documents=documents)
 802  
 803  in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1)
 804  multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever)
 805  results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"])
 806  for doc in results["documents"]:
 807      print(f"Content: {doc.content}, Score: {doc.score}")
 808  # >>
 809  # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097
 810  # >> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.615
 811  # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944
 812  ```
 813  
 814  <a id="multi_query_text_retriever.MultiQueryTextRetriever.__init__"></a>
 815  
 816  #### MultiQueryTextRetriever.\_\_init\_\_
 817  
 818  ```python
 819  def __init__(*, retriever: TextRetriever, max_workers: int = 3) -> None
 820  ```
 821  
 822  Initialize MultiQueryTextRetriever.
 823  
 824  **Arguments**:
 825  
 826  - `retriever`: The text-based retriever to use for document retrieval.
 827  - `max_workers`: Maximum number of worker threads for parallel processing. Default is 3.
 828  
 829  <a id="multi_query_text_retriever.MultiQueryTextRetriever.warm_up"></a>
 830  
 831  #### MultiQueryTextRetriever.warm\_up
 832  
 833  ```python
 834  def warm_up() -> None
 835  ```
 836  
 837  Warm up the retriever if it has a warm_up method.
 838  
 839  <a id="multi_query_text_retriever.MultiQueryTextRetriever.run"></a>
 840  
 841  #### MultiQueryTextRetriever.run
 842  
 843  ```python
 844  @component.output_types(documents=list[Document])
 845  def run(
 846      queries: list[str],
 847      retriever_kwargs: dict[str, Any] | None = None
 848  ) -> dict[str, list[Document]]
 849  ```
 850  
 851  Retrieve documents using multiple queries in parallel.
 852  
 853  **Arguments**:
 854  
 855  - `queries`: List of text queries to process.
 856  - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method.
 857  
 858  **Returns**:
 859  
 860  A dictionary containing:
 861  `documents`: List of retrieved documents sorted by relevance score.
 862  
 863  <a id="multi_query_text_retriever.MultiQueryTextRetriever.to_dict"></a>
 864  
 865  #### MultiQueryTextRetriever.to\_dict
 866  
 867  ```python
 868  def to_dict() -> dict[str, Any]
 869  ```
 870  
 871  Serializes the component to a dictionary.
 872  
 873  **Returns**:
 874  
 875  The serialized component as a dictionary.
 876  
 877  <a id="multi_query_text_retriever.MultiQueryTextRetriever.from_dict"></a>
 878  
 879  #### MultiQueryTextRetriever.from\_dict
 880  
 881  ```python
 882  @classmethod
 883  def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever"
 884  ```
 885  
 886  Deserializes the component from a dictionary.
 887  
 888  **Arguments**:
 889  
 890  - `data`: The dictionary to deserialize from.
 891  
 892  **Returns**:
 893  
 894  The deserialized component.
 895  
 896  <a id="sentence_window_retriever"></a>
 897  
 898  ## Module sentence\_window\_retriever
 899  
 900  <a id="sentence_window_retriever.SentenceWindowRetriever"></a>
 901  
 902  ### SentenceWindowRetriever
 903  
 904  Retrieves neighboring documents from a DocumentStore to provide context for query results.
 905  
 906  This component is intended to be used after a Retriever (e.g., BM25Retriever, EmbeddingRetriever).
 907  It enhances retrieved results by fetching adjacent document chunks to give
 908  additional context for the user.
 909  
 910  The documents must include metadata indicating their origin and position:
 911  - `source_id` is used to group sentence chunks belonging to the same original document.
 912  - `split_id` represents the position/order of the chunk within the document.
 913  
 914  The number of adjacent documents to include on each side of the retrieved document can be configured using the
 915  `window_size` parameter. You can also specify which metadata fields to use for source and split ID
 916  via `source_id_meta_field` and `split_id_meta_field`.
 917  
 918  The SentenceWindowRetriever is compatible with the following DocumentStores:
 919  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
 920  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
 921  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
 922  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
 923  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store)
 924  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
 925  
 926  ### Usage example
 927  
 928  ```python
 929  from haystack import Document, Pipeline
 930  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 931  from haystack.components.retrievers import SentenceWindowRetriever
 932  from haystack.components.preprocessors import DocumentSplitter
 933  from haystack.document_stores.in_memory import InMemoryDocumentStore
 934  
 935  splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
 936  text = (
 937          "This is a text with some words. There is a second sentence. And there is also a third sentence. "
 938          "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
 939  )
 940  doc = Document(content=text)
 941  docs = splitter.run([doc])
 942  doc_store = InMemoryDocumentStore()
 943  doc_store.write_documents(docs["documents"])
 944  
 945  
 946  rag = Pipeline()
 947  rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
 948  rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=2))
 949  rag.connect("bm25_retriever", "sentence_window_retriever")
 950  
 951  rag.run({'bm25_retriever': {"query":"third"}})
 952  
 953  # >> {'sentence_window_retriever': {'context_windows': ['some words. There is a second sentence.
 954  # >> And there is also a third sentence. It also contains a fourth sentence. And a fifth sentence. And a sixth
 955  # >> sentence. And a'], 'context_documents': [[Document(id=..., content: 'some words. There is a second sentence.
 956  # >> And there is ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 1, 'split_idx_start': 20,
 957  # >> '_split_overlap': [{'doc_id': '...', 'range': (20, 43)}, {'doc_id': '...', 'range': (0, 30)}]}),
 958  # >> Document(id=..., content: 'second sentence. And there is also a third sentence. It ',
 959  # >> meta: {'source_id': '74ea87deb38012873cf8c07e...f19d01a26a098447113e1d7b83efd30c02987114', 'page_number': 1,
 960  # >> 'split_id': 2, 'split_idx_start': 43, '_split_overlap': [{'doc_id': '...', 'range': (23, 53)}, {'doc_id': '.',
 961  # >> 'range': (0, 26)}]}), Document(id=..., content: 'also a third sentence. It also contains a fourth sentence. ',
 962  # >> meta: {'source_id': '...', 'page_number': 1, 'split_id': 3, 'split_idx_start': 73, '_split_overlap':
 963  # >> [{'doc_id': '...', 'range': (30, 56)}, {'doc_id': '...', 'range': (0, 33)}]}), Document(id=..., content:
 964  # >> 'also contains a fourth sentence. And a fifth sentence. And ', meta: {'source_id': '...', 'page_number': 1,
 965  # >> 'split_id': 4, 'split_idx_start': 99, '_split_overlap': [{'doc_id': '...', 'range': (26, 59)},
 966  # >> {'doc_id': '...', 'range': (0, 26)}]}), Document(id=..., content: 'And a fifth sentence. And a sixth sentence.
 967  # >> And a ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 5, 'split_idx_start': 132,
 968  # >> '_split_overlap': [{'doc_id': '...', 'range': (33, 59)}, {'doc_id': '...', 'range': (0, 24)}]})]]}}}}
 969  ```
 970  
 971  <a id="sentence_window_retriever.SentenceWindowRetriever.__init__"></a>
 972  
 973  #### SentenceWindowRetriever.\_\_init\_\_
 974  
 975  ```python
 976  def __init__(document_store: DocumentStore,
 977               window_size: int = 3,
 978               *,
 979               source_id_meta_field: str | list[str] = "source_id",
 980               split_id_meta_field: str = "split_id",
 981               raise_on_missing_meta_fields: bool = True)
 982  ```
 983  
 984  Creates a new SentenceWindowRetriever component.
 985  
 986  **Arguments**:
 987  
 988  - `document_store`: The Document Store to retrieve the surrounding documents from.
 989  - `window_size`: The number of documents to retrieve before and after the relevant one.
 990  For example, `window_size: 2` fetches 2 preceding and 2 following documents.
 991  - `source_id_meta_field`: The metadata field that contains the source ID of the document.
 992  This can be a single field or a list of fields. If multiple fields are provided, the retriever will
 993  consider the document as part of the same source if all the fields match.
 994  - `split_id_meta_field`: The metadata field that contains the split ID of the document.
 995  - `raise_on_missing_meta_fields`: If True, raises an error if the documents do not contain the required
 996  metadata fields. If False, it will skip retrieving the context for documents that are missing
 997  the required metadata fields, but will still include the original document in the results.
 998  
 999  <a id="sentence_window_retriever.SentenceWindowRetriever.merge_documents_text"></a>
1000  
1001  #### SentenceWindowRetriever.merge\_documents\_text
1002  
1003  ```python
1004  @staticmethod
1005  def merge_documents_text(documents: list[Document]) -> str
1006  ```
1007  
1008  Merge a list of document text into a single string.
1009  
1010  This functions concatenates the textual content of a list of documents into a single string, eliminating any
1011  overlapping content.
1012  
1013  **Arguments**:
1014  
1015  - `documents`: List of Documents to merge.
1016  
1017  <a id="sentence_window_retriever.SentenceWindowRetriever.to_dict"></a>
1018  
1019  #### SentenceWindowRetriever.to\_dict
1020  
1021  ```python
1022  def to_dict() -> dict[str, Any]
1023  ```
1024  
1025  Serializes the component to a dictionary.
1026  
1027  **Returns**:
1028  
1029  Dictionary with serialized data.
1030  
1031  <a id="sentence_window_retriever.SentenceWindowRetriever.from_dict"></a>
1032  
1033  #### SentenceWindowRetriever.from\_dict
1034  
1035  ```python
1036  @classmethod
1037  def from_dict(cls, data: dict[str, Any]) -> "SentenceWindowRetriever"
1038  ```
1039  
1040  Deserializes the component from a dictionary.
1041  
1042  **Returns**:
1043  
1044  Deserialized component.
1045  
1046  <a id="sentence_window_retriever.SentenceWindowRetriever.run"></a>
1047  
1048  #### SentenceWindowRetriever.run
1049  
1050  ```python
1051  @component.output_types(context_windows=list[str],
1052                          context_documents=list[Document])
1053  def run(retrieved_documents: list[Document], window_size: int | None = None)
1054  ```
1055  
1056  Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store.
1057  
1058  Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given
1059  document from the document store.
1060  
1061  **Arguments**:
1062  
1063  - `retrieved_documents`: List of retrieved documents from the previous retriever.
1064  - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite
1065  the `window_size` parameter set in the constructor.
1066  
1067  **Returns**:
1068  
1069  A dictionary with the following keys:
1070  - `context_windows`: A list of strings, where each string represents the concatenated text from the
1071                       context window of the corresponding document in `retrieved_documents`.
1072  - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context
1073                        document surrounding them. The documents are sorted by the `split_idx_start`
1074                        meta field.
1075  
1076  <a id="sentence_window_retriever.SentenceWindowRetriever.run_async"></a>
1077  
1078  #### SentenceWindowRetriever.run\_async
1079  
1080  ```python
1081  @component.output_types(context_windows=list[str],
1082                          context_documents=list[Document])
1083  async def run_async(retrieved_documents: list[Document],
1084                      window_size: int | None = None)
1085  ```
1086  
1087  Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store.
1088  
1089  Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given
1090  document from the document store.
1091  
1092  **Arguments**:
1093  
1094  - `retrieved_documents`: List of retrieved documents from the previous retriever.
1095  - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite
1096  the `window_size` parameter set in the constructor.
1097  
1098  **Returns**:
1099  
1100  A dictionary with the following keys:
1101  - `context_windows`: A list of strings, where each string represents the concatenated text from the
1102                       context window of the corresponding document in `retrieved_documents`.
1103  - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context
1104                        document surrounding them. The documents are sorted by the `split_idx_start`
1105                        meta field.
1106