Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.24 / haystack-api / retrievers_api.md
retrievers_api.md
   1  ---
   2  title: "Retrievers"
   3  id: retrievers-api
   4  description: "Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query."
   5  slug: "/retrievers-api"
   6  ---
   7  
   8  <a id="auto_merging_retriever"></a>
   9  
  10  ## Module auto\_merging\_retriever
  11  
  12  <a id="auto_merging_retriever.AutoMergingRetriever"></a>
  13  
  14  ### AutoMergingRetriever
  15  
  16  A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.
  17  
  18  The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes
  19  are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create
  20  such a structure. During retrieval, if the number of matched leaf documents below the same parent is
  21  higher than a defined threshold, the retriever will return the parent document instead of the individual leaf
  22  documents.
  23  
  24  The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for
  25  a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual
  26  chunks alone.
  27  
  28  Currently the AutoMergingRetriever can only be used by the following DocumentStores:
  29  - [AstraDB](https://haystack.deepset.ai/integrations/astradb)
  30  - [ElasticSearch](https://haystack.deepset.ai/docs/latest/documentstore/elasticsearch)
  31  - [OpenSearch](https://haystack.deepset.ai/docs/latest/documentstore/opensearch)
  32  - [PGVector](https://haystack.deepset.ai/docs/latest/documentstore/pgvector)
  33  - [Qdrant](https://haystack.deepset.ai/docs/latest/documentstore/qdrant)
  34  
  35  ```python
  36  from haystack import Document
  37  from haystack.components.preprocessors import HierarchicalDocumentSplitter
  38  from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
  39  from haystack.document_stores.in_memory import InMemoryDocumentStore
  40  
  41  # create a hierarchical document structure with 3 levels, where the parent document has 3 children
  42  text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
  43  original_document = Document(content=text)
  44  builder = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
  45  docs = builder.run([original_document])["documents"]
  46  
  47  # store level-1 parent documents and initialize the retriever
  48  doc_store_parents = InMemoryDocumentStore()
  49  for doc in docs:
  50      if doc.meta["__children_ids"] and doc.meta["__level"] in [0,1]:  # store the root document and level 1 documents
  51          doc_store_parents.write_documents([doc])
  52  
  53  retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
  54  
  55  # assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
  56  # since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
  57  leaf_docs = [doc for doc in docs if not doc.meta["__children_ids"]]
  58  retrieved_docs = retriever.run(leaf_docs[4:6])
  59  print(retrieved_docs["documents"])
  60  # [Document(id=538..),
  61  # content: 'warm glow over the trees. Birds began to sing.',
  62  # meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
  63  # 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
  64  ```
  65  
  66  <a id="auto_merging_retriever.AutoMergingRetriever.__init__"></a>
  67  
  68  #### AutoMergingRetriever.\_\_init\_\_
  69  
  70  ```python
  71  def __init__(document_store: DocumentStore, threshold: float = 0.5)
  72  ```
  73  
  74  Initialize the AutoMergingRetriever.
  75  
  76  **Arguments**:
  77  
  78  - `document_store`: DocumentStore from which to retrieve the parent documents
  79  - `threshold`: Threshold to decide whether the parent instead of the individual documents is returned
  80  
  81  <a id="auto_merging_retriever.AutoMergingRetriever.to_dict"></a>
  82  
  83  #### AutoMergingRetriever.to\_dict
  84  
  85  ```python
  86  def to_dict() -> dict[str, Any]
  87  ```
  88  
  89  Serializes the component to a dictionary.
  90  
  91  **Returns**:
  92  
  93  Dictionary with serialized data.
  94  
  95  <a id="auto_merging_retriever.AutoMergingRetriever.from_dict"></a>
  96  
  97  #### AutoMergingRetriever.from\_dict
  98  
  99  ```python
 100  @classmethod
 101  def from_dict(cls, data: dict[str, Any]) -> "AutoMergingRetriever"
 102  ```
 103  
 104  Deserializes the component from a dictionary.
 105  
 106  **Arguments**:
 107  
 108  - `data`: Dictionary with serialized data.
 109  
 110  **Returns**:
 111  
 112  An instance of the component.
 113  
 114  <a id="auto_merging_retriever.AutoMergingRetriever.run"></a>
 115  
 116  #### AutoMergingRetriever.run
 117  
 118  ```python
 119  @component.output_types(documents=list[Document])
 120  def run(documents: list[Document])
 121  ```
 122  
 123  Run the AutoMergingRetriever.
 124  
 125  Recursively groups documents by their parents and merges them if they meet the threshold,
 126  continuing up the hierarchy until no more merges are possible.
 127  
 128  **Arguments**:
 129  
 130  - `documents`: List of leaf documents that were matched by a retriever
 131  
 132  **Returns**:
 133  
 134  List of documents (could be a mix of different hierarchy levels)
 135  
 136  <a id="auto_merging_retriever.AutoMergingRetriever.run_async"></a>
 137  
 138  #### AutoMergingRetriever.run\_async
 139  
 140  ```python
 141  @component.output_types(documents=list[Document])
 142  async def run_async(documents: list[Document])
 143  ```
 144  
 145  Asynchronously run the AutoMergingRetriever.
 146  
 147  Recursively groups documents by their parents and merges them if they meet the threshold,
 148  continuing up the hierarchy until no more merges are possible.
 149  
 150  **Arguments**:
 151  
 152  - `documents`: List of leaf documents that were matched by a retriever
 153  
 154  **Returns**:
 155  
 156  List of documents (could be a mix of different hierarchy levels)
 157  
 158  <a id="filter_retriever"></a>
 159  
 160  ## Module filter\_retriever
 161  
 162  <a id="filter_retriever.FilterRetriever"></a>
 163  
 164  ### FilterRetriever
 165  
 166  Retrieves documents that match the provided filters.
 167  
 168  ### Usage example
 169  
 170  ```python
 171  from haystack import Document
 172  from haystack.components.retrievers import FilterRetriever
 173  from haystack.document_stores.in_memory import InMemoryDocumentStore
 174  
 175  docs = [
 176      Document(content="Python is a popular programming language", meta={"lang": "en"}),
 177      Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}),
 178  ]
 179  
 180  doc_store = InMemoryDocumentStore()
 181  doc_store.write_documents(docs)
 182  retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"})
 183  
 184  # if passed in the run method, filters override those provided at initialization
 185  result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"})
 186  
 187  print(result["documents"])
 188  ```
 189  
 190  <a id="filter_retriever.FilterRetriever.__init__"></a>
 191  
 192  #### FilterRetriever.\_\_init\_\_
 193  
 194  ```python
 195  def __init__(document_store: DocumentStore,
 196               filters: dict[str, Any] | None = None)
 197  ```
 198  
 199  Create the FilterRetriever component.
 200  
 201  **Arguments**:
 202  
 203  - `document_store`: An instance of a Document Store to use with the Retriever.
 204  - `filters`: A dictionary with filters to narrow down the search space.
 205  
 206  <a id="filter_retriever.FilterRetriever.to_dict"></a>
 207  
 208  #### FilterRetriever.to\_dict
 209  
 210  ```python
 211  def to_dict() -> dict[str, Any]
 212  ```
 213  
 214  Serializes the component to a dictionary.
 215  
 216  **Returns**:
 217  
 218  Dictionary with serialized data.
 219  
 220  <a id="filter_retriever.FilterRetriever.from_dict"></a>
 221  
 222  #### FilterRetriever.from\_dict
 223  
 224  ```python
 225  @classmethod
 226  def from_dict(cls, data: dict[str, Any]) -> "FilterRetriever"
 227  ```
 228  
 229  Deserializes the component from a dictionary.
 230  
 231  **Arguments**:
 232  
 233  - `data`: The dictionary to deserialize from.
 234  
 235  **Returns**:
 236  
 237  The deserialized component.
 238  
 239  <a id="filter_retriever.FilterRetriever.run"></a>
 240  
 241  #### FilterRetriever.run
 242  
 243  ```python
 244  @component.output_types(documents=list[Document])
 245  def run(filters: dict[str, Any] | None = None)
 246  ```
 247  
 248  Run the FilterRetriever on the given input data.
 249  
 250  **Arguments**:
 251  
 252  - `filters`: A dictionary with filters to narrow down the search space.
 253  If not specified, the FilterRetriever uses the values provided at initialization.
 254  
 255  **Returns**:
 256  
 257  A list of retrieved documents.
 258  
 259  <a id="filter_retriever.FilterRetriever.run_async"></a>
 260  
 261  #### FilterRetriever.run\_async
 262  
 263  ```python
 264  @component.output_types(documents=list[Document])
 265  async def run_async(filters: dict[str, Any] | None = None)
 266  ```
 267  
 268  Asynchronously run the FilterRetriever on the given input data.
 269  
 270  **Arguments**:
 271  
 272  - `filters`: A dictionary with filters to narrow down the search space.
 273  If not specified, the FilterRetriever uses the values provided at initialization.
 274  
 275  **Returns**:
 276  
 277  A list of retrieved documents.
 278  
 279  <a id="in_memory/bm25_retriever"></a>
 280  
 281  ## Module in\_memory/bm25\_retriever
 282  
 283  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever"></a>
 284  
 285  ### InMemoryBM25Retriever
 286  
 287  Retrieves documents that are most similar to the query using keyword-based algorithm.
 288  
 289  Use this retriever with the InMemoryDocumentStore.
 290  
 291  ### Usage example
 292  
 293  ```python
 294  from haystack import Document
 295  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 296  from haystack.document_stores.in_memory import InMemoryDocumentStore
 297  
 298  docs = [
 299      Document(content="Python is a popular programming language"),
 300      Document(content="python ist eine beliebte Programmiersprache"),
 301  ]
 302  
 303  doc_store = InMemoryDocumentStore()
 304  doc_store.write_documents(docs)
 305  retriever = InMemoryBM25Retriever(doc_store)
 306  
 307  result = retriever.run(query="Programmiersprache")
 308  
 309  print(result["documents"])
 310  ```
 311  
 312  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.__init__"></a>
 313  
 314  #### InMemoryBM25Retriever.\_\_init\_\_
 315  
 316  ```python
 317  def __init__(document_store: InMemoryDocumentStore,
 318               filters: dict[str, Any] | None = None,
 319               top_k: int = 10,
 320               scale_score: bool = False,
 321               filter_policy: FilterPolicy = FilterPolicy.REPLACE)
 322  ```
 323  
 324  Create the InMemoryBM25Retriever component.
 325  
 326  **Arguments**:
 327  
 328  - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.
 329  - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store.
 330  - `top_k`: The maximum number of documents to retrieve.
 331  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 332  When `False`, uses raw similarity scores.
 333  - `filter_policy`: The filter policy to apply during retrieval.
 334  Filter policy determines how filters are applied when retrieving documents. You can choose:
 335  - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime.
 336  Use this policy to dynamically change filtering for specific queries.
 337  - `MERGE`: Combines runtime filters with initialization filters to narrow down the search.
 338  
 339  **Raises**:
 340  
 341  - `ValueError`: If the specified `top_k` is not > 0.
 342  
 343  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.to_dict"></a>
 344  
 345  #### InMemoryBM25Retriever.to\_dict
 346  
 347  ```python
 348  def to_dict() -> dict[str, Any]
 349  ```
 350  
 351  Serializes the component to a dictionary.
 352  
 353  **Returns**:
 354  
 355  Dictionary with serialized data.
 356  
 357  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.from_dict"></a>
 358  
 359  #### InMemoryBM25Retriever.from\_dict
 360  
 361  ```python
 362  @classmethod
 363  def from_dict(cls, data: dict[str, Any]) -> "InMemoryBM25Retriever"
 364  ```
 365  
 366  Deserializes the component from a dictionary.
 367  
 368  **Arguments**:
 369  
 370  - `data`: The dictionary to deserialize from.
 371  
 372  **Returns**:
 373  
 374  The deserialized component.
 375  
 376  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run"></a>
 377  
 378  #### InMemoryBM25Retriever.run
 379  
 380  ```python
 381  @component.output_types(documents=list[Document])
 382  def run(query: str,
 383          filters: dict[str, Any] | None = None,
 384          top_k: int | None = None,
 385          scale_score: bool | None = None) -> dict[str, list[Document]]
 386  ```
 387  
 388  Run the InMemoryBM25Retriever on the given input data.
 389  
 390  **Arguments**:
 391  
 392  - `query`: The query string for the Retriever.
 393  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 394  - `top_k`: The maximum number of documents to return.
 395  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 396  When `False`, uses raw similarity scores.
 397  
 398  **Raises**:
 399  
 400  - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
 401  
 402  **Returns**:
 403  
 404  The retrieved documents.
 405  
 406  <a id="in_memory/bm25_retriever.InMemoryBM25Retriever.run_async"></a>
 407  
 408  #### InMemoryBM25Retriever.run\_async
 409  
 410  ```python
 411  @component.output_types(documents=list[Document])
 412  async def run_async(
 413          query: str,
 414          filters: dict[str, Any] | None = None,
 415          top_k: int | None = None,
 416          scale_score: bool | None = None) -> dict[str, list[Document]]
 417  ```
 418  
 419  Run the InMemoryBM25Retriever on the given input data.
 420  
 421  **Arguments**:
 422  
 423  - `query`: The query string for the Retriever.
 424  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 425  - `top_k`: The maximum number of documents to return.
 426  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 427  When `False`, uses raw similarity scores.
 428  
 429  **Raises**:
 430  
 431  - `ValueError`: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
 432  
 433  **Returns**:
 434  
 435  The retrieved documents.
 436  
 437  <a id="in_memory/embedding_retriever"></a>
 438  
 439  ## Module in\_memory/embedding\_retriever
 440  
 441  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever"></a>
 442  
 443  ### InMemoryEmbeddingRetriever
 444  
 445  Retrieves documents that are most semantically similar to the query.
 446  
 447  Use this retriever with the InMemoryDocumentStore.
 448  
 449  When using this retriever, make sure it has query and document embeddings available.
 450  In indexing pipelines, use a DocumentEmbedder to embed documents.
 451  In query pipelines, use a TextEmbedder to embed queries and send them to the retriever.
 452  
 453  ### Usage example
 454  ```python
 455  from haystack import Document
 456  from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
 457  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
 458  from haystack.document_stores.in_memory import InMemoryDocumentStore
 459  
 460  docs = [
 461      Document(content="Python is a popular programming language"),
 462      Document(content="python ist eine beliebte Programmiersprache"),
 463  ]
 464  doc_embedder = SentenceTransformersDocumentEmbedder()
 465  doc_embedder.warm_up()
 466  docs_with_embeddings = doc_embedder.run(docs)["documents"]
 467  
 468  doc_store = InMemoryDocumentStore()
 469  doc_store.write_documents(docs_with_embeddings)
 470  retriever = InMemoryEmbeddingRetriever(doc_store)
 471  
 472  query="Programmiersprache"
 473  text_embedder = SentenceTransformersTextEmbedder()
 474  text_embedder.warm_up()
 475  query_embedding = text_embedder.run(query)["embedding"]
 476  
 477  result = retriever.run(query_embedding=query_embedding)
 478  
 479  print(result["documents"])
 480  ```
 481  
 482  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.__init__"></a>
 483  
 484  #### InMemoryEmbeddingRetriever.\_\_init\_\_
 485  
 486  ```python
 487  def __init__(document_store: InMemoryDocumentStore,
 488               filters: dict[str, Any] | None = None,
 489               top_k: int = 10,
 490               scale_score: bool = False,
 491               return_embedding: bool = False,
 492               filter_policy: FilterPolicy = FilterPolicy.REPLACE)
 493  ```
 494  
 495  Create the InMemoryEmbeddingRetriever component.
 496  
 497  **Arguments**:
 498  
 499  - `document_store`: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.
 500  - `filters`: A dictionary with filters to narrow down the retriever's search space in the document store.
 501  - `top_k`: The maximum number of documents to retrieve.
 502  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 503  When `False`, uses raw similarity scores.
 504  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 505  When `False`, returns just the documents, without their embeddings.
 506  - `filter_policy`: The filter policy to apply during retrieval.
 507  Filter policy determines how filters are applied when retrieving documents. You can choose:
 508  - `REPLACE` (default): Overrides the initialization filters with the filters specified at runtime.
 509  Use this policy to dynamically change filtering for specific queries.
 510  - `MERGE`: Combines runtime filters with initialization filters to narrow down the search.
 511  
 512  **Raises**:
 513  
 514  - `ValueError`: If the specified top_k is not > 0.
 515  
 516  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.to_dict"></a>
 517  
 518  #### InMemoryEmbeddingRetriever.to\_dict
 519  
 520  ```python
 521  def to_dict() -> dict[str, Any]
 522  ```
 523  
 524  Serializes the component to a dictionary.
 525  
 526  **Returns**:
 527  
 528  Dictionary with serialized data.
 529  
 530  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.from_dict"></a>
 531  
 532  #### InMemoryEmbeddingRetriever.from\_dict
 533  
 534  ```python
 535  @classmethod
 536  def from_dict(cls, data: dict[str, Any]) -> "InMemoryEmbeddingRetriever"
 537  ```
 538  
 539  Deserializes the component from a dictionary.
 540  
 541  **Arguments**:
 542  
 543  - `data`: The dictionary to deserialize from.
 544  
 545  **Returns**:
 546  
 547  The deserialized component.
 548  
 549  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run"></a>
 550  
 551  #### InMemoryEmbeddingRetriever.run
 552  
 553  ```python
 554  @component.output_types(documents=list[Document])
 555  def run(query_embedding: list[float],
 556          filters: dict[str, Any] | None = None,
 557          top_k: int | None = None,
 558          scale_score: bool | None = None,
 559          return_embedding: bool | None = None) -> dict[str, list[Document]]
 560  ```
 561  
 562  Run the InMemoryEmbeddingRetriever on the given input data.
 563  
 564  **Arguments**:
 565  
 566  - `query_embedding`: Embedding of the query.
 567  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 568  - `top_k`: The maximum number of documents to return.
 569  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 570  When `False`, uses raw similarity scores.
 571  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 572  When `False`, returns just the documents, without their embeddings.
 573  
 574  **Raises**:
 575  
 576  - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
 577  
 578  **Returns**:
 579  
 580  The retrieved documents.
 581  
 582  <a id="in_memory/embedding_retriever.InMemoryEmbeddingRetriever.run_async"></a>
 583  
 584  #### InMemoryEmbeddingRetriever.run\_async
 585  
 586  ```python
 587  @component.output_types(documents=list[Document])
 588  async def run_async(
 589          query_embedding: list[float],
 590          filters: dict[str, Any] | None = None,
 591          top_k: int | None = None,
 592          scale_score: bool | None = None,
 593          return_embedding: bool | None = None) -> dict[str, list[Document]]
 594  ```
 595  
 596  Run the InMemoryEmbeddingRetriever on the given input data.
 597  
 598  **Arguments**:
 599  
 600  - `query_embedding`: Embedding of the query.
 601  - `filters`: A dictionary with filters to narrow down the search space when retrieving documents.
 602  - `top_k`: The maximum number of documents to return.
 603  - `scale_score`: When `True`, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant.
 604  When `False`, uses raw similarity scores.
 605  - `return_embedding`: When `True`, returns the embedding of the retrieved documents.
 606  When `False`, returns just the documents, without their embeddings.
 607  
 608  **Raises**:
 609  
 610  - `ValueError`: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
 611  
 612  **Returns**:
 613  
 614  The retrieved documents.
 615  
 616  <a id="multi_query_embedding_retriever"></a>
 617  
 618  ## Module multi\_query\_embedding\_retriever
 619  
 620  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever"></a>
 621  
 622  ### MultiQueryEmbeddingRetriever
 623  
 624  A component that retrieves documents using multiple queries in parallel with an embedding-based retriever.
 625  
 626  This component takes a list of text queries, converts them to embeddings using a query embedder,
 627  and then uses an embedding-based retriever to find relevant documents for each query in parallel.
 628  The results are combined and sorted by relevance score.
 629  
 630  ### Usage example
 631  
 632  ```python
 633  from haystack import Document
 634  from haystack.document_stores.in_memory import InMemoryDocumentStore
 635  from haystack.document_stores.types import DuplicatePolicy
 636  from haystack.components.embedders import SentenceTransformersTextEmbedder
 637  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
 638  from haystack.components.retrievers import InMemoryEmbeddingRetriever
 639  from haystack.components.writers import DocumentWriter
 640  from haystack.components.retrievers import MultiQueryEmbeddingRetriever
 641  
 642  documents = [
 643      Document(content="Renewable energy is energy that is collected from renewable resources."),
 644      Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
 645      Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
 646      Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."),
 647      Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."),
 648      Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."),
 649  ]
 650  
 651  # Populate the document store
 652  doc_store = InMemoryDocumentStore()
 653  doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
 654  doc_embedder.warm_up()
 655  doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP)
 656  documents = doc_embedder.run(documents)["documents"]
 657  doc_writer.run(documents=documents)
 658  
 659  # Run the multi-query retriever
 660  in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1)
 661  query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
 662  
 663  multi_query_retriever = MultiQueryEmbeddingRetriever(
 664      retriever=in_memory_retriever,
 665      query_embedder=query_embedder,
 666      max_workers=3
 667  )
 668  
 669  queries = ["Geothermal energy", "natural gas", "turbines"]
 670  result = multi_query_retriever.run(queries=queries)
 671  for doc in result["documents"]:
 672      print(f"Content: {doc.content}, Score: {doc.score}")
 673  # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574
 674  # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034
 675  # >> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354
 676  # >> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680
 677  # >> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.30914239725622
 678  # >> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243
 679  ```
 680  
 681  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.__init__"></a>
 682  
 683  #### MultiQueryEmbeddingRetriever.\_\_init\_\_
 684  
 685  ```python
 686  def __init__(*,
 687               retriever: EmbeddingRetriever,
 688               query_embedder: TextEmbedder,
 689               max_workers: int = 3) -> None
 690  ```
 691  
 692  Initialize MultiQueryEmbeddingRetriever.
 693  
 694  **Arguments**:
 695  
 696  - `retriever`: The embedding-based retriever to use for document retrieval.
 697  - `query_embedder`: The query embedder to convert text queries to embeddings.
 698  - `max_workers`: Maximum number of worker threads for parallel processing.
 699  
 700  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.warm_up"></a>
 701  
 702  #### MultiQueryEmbeddingRetriever.warm\_up
 703  
 704  ```python
 705  def warm_up() -> None
 706  ```
 707  
 708  Warm up the query embedder and the retriever if any has a warm_up method.
 709  
 710  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.run"></a>
 711  
 712  #### MultiQueryEmbeddingRetriever.run
 713  
 714  ```python
 715  @component.output_types(documents=list[Document])
 716  def run(
 717      queries: list[str],
 718      retriever_kwargs: dict[str, Any] | None = None
 719  ) -> dict[str, list[Document]]
 720  ```
 721  
 722  Retrieve documents using multiple queries in parallel.
 723  
 724  **Arguments**:
 725  
 726  - `queries`: List of text queries to process.
 727  - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method.
 728  
 729  **Returns**:
 730  
 731  A dictionary containing:
 732  - `documents`: List of retrieved documents sorted by relevance score.
 733  
 734  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.to_dict"></a>
 735  
 736  #### MultiQueryEmbeddingRetriever.to\_dict
 737  
 738  ```python
 739  def to_dict() -> dict[str, Any]
 740  ```
 741  
 742  Serializes the component to a dictionary.
 743  
 744  **Returns**:
 745  
 746  A dictionary representing the serialized component.
 747  
 748  <a id="multi_query_embedding_retriever.MultiQueryEmbeddingRetriever.from_dict"></a>
 749  
 750  #### MultiQueryEmbeddingRetriever.from\_dict
 751  
 752  ```python
 753  @classmethod
 754  def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever"
 755  ```
 756  
 757  Deserializes the component from a dictionary.
 758  
 759  **Arguments**:
 760  
 761  - `data`: The dictionary to deserialize from.
 762  
 763  **Returns**:
 764  
 765  The deserialized component.
 766  
 767  <a id="multi_query_text_retriever"></a>
 768  
 769  ## Module multi\_query\_text\_retriever
 770  
 771  <a id="multi_query_text_retriever.MultiQueryTextRetriever"></a>
 772  
 773  ### MultiQueryTextRetriever
 774  
 775  A component that retrieves documents using multiple queries in parallel with a text-based retriever.
 776  
 777  This component takes a list of text queries and uses a text-based retriever to find relevant documents for each
 778  query in parallel, using a thread pool to manage concurrent execution. The results are combined and sorted by
 779  relevance score.
 780  
 781  You can use this component in combination with QueryExpander component to enhance the retrieval process.
 782  
 783  ### Usage example
 784  ```python
 785  from haystack import Document
 786  from haystack.components.writers import DocumentWriter
 787  from haystack.document_stores.in_memory import InMemoryDocumentStore
 788  from haystack.document_stores.types import DuplicatePolicy
 789  from haystack.components.retrievers import InMemoryBM25Retriever
 790  from haystack.components.query import QueryExpander
 791  from haystack.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever
 792  
 793  documents = [
 794      Document(content="Renewable energy is energy that is collected from renewable resources."),
 795      Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
 796      Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
 797      Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."),
 798      Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.")
 799  ]
 800  
 801  document_store = InMemoryDocumentStore()
 802  doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
 803  doc_writer.run(documents=documents)
 804  
 805  in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1)
 806  multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever)
 807  results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"])
 808  for doc in results["documents"]:
 809      print(f"Content: {doc.content}, Score: {doc.score}")
 810  # >>
 811  # >> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097
 812  # >> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.615
 813  # >> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944
 814  ```
 815  
 816  <a id="multi_query_text_retriever.MultiQueryTextRetriever.__init__"></a>
 817  
 818  #### MultiQueryTextRetriever.\_\_init\_\_
 819  
 820  ```python
 821  def __init__(*, retriever: TextRetriever, max_workers: int = 3) -> None
 822  ```
 823  
 824  Initialize MultiQueryTextRetriever.
 825  
 826  **Arguments**:
 827  
 828  - `retriever`: The text-based retriever to use for document retrieval.
 829  - `max_workers`: Maximum number of worker threads for parallel processing. Default is 3.
 830  
 831  <a id="multi_query_text_retriever.MultiQueryTextRetriever.warm_up"></a>
 832  
 833  #### MultiQueryTextRetriever.warm\_up
 834  
 835  ```python
 836  def warm_up() -> None
 837  ```
 838  
 839  Warm up the retriever if it has a warm_up method.
 840  
 841  <a id="multi_query_text_retriever.MultiQueryTextRetriever.run"></a>
 842  
 843  #### MultiQueryTextRetriever.run
 844  
 845  ```python
 846  @component.output_types(documents=list[Document])
 847  def run(
 848      queries: list[str],
 849      retriever_kwargs: dict[str, Any] | None = None
 850  ) -> dict[str, list[Document]]
 851  ```
 852  
 853  Retrieve documents using multiple queries in parallel.
 854  
 855  **Arguments**:
 856  
 857  - `queries`: List of text queries to process.
 858  - `retriever_kwargs`: Optional dictionary of arguments to pass to the retriever's run method.
 859  
 860  **Returns**:
 861  
 862  A dictionary containing:
 863  `documents`: List of retrieved documents sorted by relevance score.
 864  
 865  <a id="multi_query_text_retriever.MultiQueryTextRetriever.to_dict"></a>
 866  
 867  #### MultiQueryTextRetriever.to\_dict
 868  
 869  ```python
 870  def to_dict() -> dict[str, Any]
 871  ```
 872  
 873  Serializes the component to a dictionary.
 874  
 875  **Returns**:
 876  
 877  The serialized component as a dictionary.
 878  
 879  <a id="multi_query_text_retriever.MultiQueryTextRetriever.from_dict"></a>
 880  
 881  #### MultiQueryTextRetriever.from\_dict
 882  
 883  ```python
 884  @classmethod
 885  def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever"
 886  ```
 887  
 888  Deserializes the component from a dictionary.
 889  
 890  **Arguments**:
 891  
 892  - `data`: The dictionary to deserialize from.
 893  
 894  **Returns**:
 895  
 896  The deserialized component.
 897  
 898  <a id="sentence_window_retriever"></a>
 899  
 900  ## Module sentence\_window\_retriever
 901  
 902  <a id="sentence_window_retriever.SentenceWindowRetriever"></a>
 903  
 904  ### SentenceWindowRetriever
 905  
 906  Retrieves neighboring documents from a DocumentStore to provide context for query results.
 907  
 908  This component is intended to be used after a Retriever (e.g., BM25Retriever, EmbeddingRetriever).
 909  It enhances retrieved results by fetching adjacent document chunks to give
 910  additional context for the user.
 911  
 912  The documents must include metadata indicating their origin and position:
 913  - `source_id` is used to group sentence chunks belonging to the same original document.
 914  - `split_id` represents the position/order of the chunk within the document.
 915  
 916  The number of adjacent documents to include on each side of the retrieved document can be configured using the
 917  `window_size` parameter. You can also specify which metadata fields to use for source and split ID
 918  via `source_id_meta_field` and `split_id_meta_field`.
 919  
 920  The SentenceWindowRetriever is compatible with the following DocumentStores:
 921  - [Astra](https://docs.haystack.deepset.ai/docs/astradocumentstore)
 922  - [Elasticsearch](https://docs.haystack.deepset.ai/docs/elasticsearch-document-store)
 923  - [OpenSearch](https://docs.haystack.deepset.ai/docs/opensearch-document-store)
 924  - [Pgvector](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore)
 925  - [Pinecone](https://docs.haystack.deepset.ai/docs/pinecone-document-store)
 926  - [Qdrant](https://docs.haystack.deepset.ai/docs/qdrant-document-store)
 927  
 928  ### Usage example
 929  
 930  ```python
 931  from haystack import Document, Pipeline
 932  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 933  from haystack.components.retrievers import SentenceWindowRetriever
 934  from haystack.components.preprocessors import DocumentSplitter
 935  from haystack.document_stores.in_memory import InMemoryDocumentStore
 936  
 937  splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
 938  text = (
 939          "This is a text with some words. There is a second sentence. And there is also a third sentence. "
 940          "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
 941  )
 942  doc = Document(content=text)
 943  docs = splitter.run([doc])
 944  doc_store = InMemoryDocumentStore()
 945  doc_store.write_documents(docs["documents"])
 946  
 947  
 948  rag = Pipeline()
 949  rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
 950  rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=2))
 951  rag.connect("bm25_retriever", "sentence_window_retriever")
 952  
 953  rag.run({'bm25_retriever': {"query":"third"}})
 954  
 955  # >> {'sentence_window_retriever': {'context_windows': ['some words. There is a second sentence.
 956  # >> And there is also a third sentence. It also contains a fourth sentence. And a fifth sentence. And a sixth
 957  # >> sentence. And a'], 'context_documents': [[Document(id=..., content: 'some words. There is a second sentence.
 958  # >> And there is ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 1, 'split_idx_start': 20,
 959  # >> '_split_overlap': [{'doc_id': '...', 'range': (20, 43)}, {'doc_id': '...', 'range': (0, 30)}]}),
 960  # >> Document(id=..., content: 'second sentence. And there is also a third sentence. It ',
 961  # >> meta: {'source_id': '74ea87deb38012873cf8c07e...f19d01a26a098447113e1d7b83efd30c02987114', 'page_number': 1,
 962  # >> 'split_id': 2, 'split_idx_start': 43, '_split_overlap': [{'doc_id': '...', 'range': (23, 53)}, {'doc_id': '.',
 963  # >> 'range': (0, 26)}]}), Document(id=..., content: 'also a third sentence. It also contains a fourth sentence. ',
 964  # >> meta: {'source_id': '...', 'page_number': 1, 'split_id': 3, 'split_idx_start': 73, '_split_overlap':
 965  # >> [{'doc_id': '...', 'range': (30, 56)}, {'doc_id': '...', 'range': (0, 33)}]}), Document(id=..., content:
 966  # >> 'also contains a fourth sentence. And a fifth sentence. And ', meta: {'source_id': '...', 'page_number': 1,
 967  # >> 'split_id': 4, 'split_idx_start': 99, '_split_overlap': [{'doc_id': '...', 'range': (26, 59)},
 968  # >> {'doc_id': '...', 'range': (0, 26)}]}), Document(id=..., content: 'And a fifth sentence. And a sixth sentence.
 969  # >> And a ', meta: {'source_id': '...', 'page_number': 1, 'split_id': 5, 'split_idx_start': 132,
 970  # >> '_split_overlap': [{'doc_id': '...', 'range': (33, 59)}, {'doc_id': '...', 'range': (0, 24)}]})]]}}}}
 971  ```
 972  
 973  <a id="sentence_window_retriever.SentenceWindowRetriever.__init__"></a>
 974  
 975  #### SentenceWindowRetriever.\_\_init\_\_
 976  
 977  ```python
 978  def __init__(document_store: DocumentStore,
 979               window_size: int = 3,
 980               *,
 981               source_id_meta_field: str | list[str] = "source_id",
 982               split_id_meta_field: str = "split_id",
 983               raise_on_missing_meta_fields: bool = True)
 984  ```
 985  
 986  Creates a new SentenceWindowRetriever component.
 987  
 988  **Arguments**:
 989  
 990  - `document_store`: The Document Store to retrieve the surrounding documents from.
 991  - `window_size`: The number of documents to retrieve before and after the relevant one.
 992  For example, `window_size: 2` fetches 2 preceding and 2 following documents.
 993  - `source_id_meta_field`: The metadata field that contains the source ID of the document.
 994  This can be a single field or a list of fields. If multiple fields are provided, the retriever will
 995  consider the document as part of the same source if all the fields match.
 996  - `split_id_meta_field`: The metadata field that contains the split ID of the document.
 997  - `raise_on_missing_meta_fields`: If True, raises an error if the documents do not contain the required
 998  metadata fields. If False, it will skip retrieving the context for documents that are missing
 999  the required metadata fields, but will still include the original document in the results.
1000  
1001  <a id="sentence_window_retriever.SentenceWindowRetriever.merge_documents_text"></a>
1002  
1003  #### SentenceWindowRetriever.merge\_documents\_text
1004  
1005  ```python
1006  @staticmethod
1007  def merge_documents_text(documents: list[Document]) -> str
1008  ```
1009  
1010  Merge a list of document text into a single string.
1011  
1012  This functions concatenates the textual content of a list of documents into a single string, eliminating any
1013  overlapping content.
1014  
1015  **Arguments**:
1016  
1017  - `documents`: List of Documents to merge.
1018  
1019  <a id="sentence_window_retriever.SentenceWindowRetriever.to_dict"></a>
1020  
1021  #### SentenceWindowRetriever.to\_dict
1022  
1023  ```python
1024  def to_dict() -> dict[str, Any]
1025  ```
1026  
1027  Serializes the component to a dictionary.
1028  
1029  **Returns**:
1030  
1031  Dictionary with serialized data.
1032  
1033  <a id="sentence_window_retriever.SentenceWindowRetriever.from_dict"></a>
1034  
1035  #### SentenceWindowRetriever.from\_dict
1036  
1037  ```python
1038  @classmethod
1039  def from_dict(cls, data: dict[str, Any]) -> "SentenceWindowRetriever"
1040  ```
1041  
1042  Deserializes the component from a dictionary.
1043  
1044  **Returns**:
1045  
1046  Deserialized component.
1047  
1048  <a id="sentence_window_retriever.SentenceWindowRetriever.run"></a>
1049  
1050  #### SentenceWindowRetriever.run
1051  
1052  ```python
1053  @component.output_types(context_windows=list[str],
1054                          context_documents=list[Document])
1055  def run(retrieved_documents: list[Document], window_size: int | None = None)
1056  ```
1057  
1058  Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store.
1059  
1060  Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given
1061  document from the document store.
1062  
1063  **Arguments**:
1064  
1065  - `retrieved_documents`: List of retrieved documents from the previous retriever.
1066  - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite
1067  the `window_size` parameter set in the constructor.
1068  
1069  **Returns**:
1070  
1071  A dictionary with the following keys:
1072  - `context_windows`: A list of strings, where each string represents the concatenated text from the
1073                       context window of the corresponding document in `retrieved_documents`.
1074  - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context
1075                        document surrounding them. The documents are sorted by the `split_idx_start`
1076                        meta field.
1077  
1078  <a id="sentence_window_retriever.SentenceWindowRetriever.run_async"></a>
1079  
1080  #### SentenceWindowRetriever.run\_async
1081  
1082  ```python
1083  @component.output_types(context_windows=list[str],
1084                          context_documents=list[Document])
1085  async def run_async(retrieved_documents: list[Document],
1086                      window_size: int | None = None)
1087  ```
1088  
1089  Based on the `source_id` and on the `doc.meta['split_id']` get surrounding documents from the document store.
1090  
1091  Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given
1092  document from the document store.
1093  
1094  **Arguments**:
1095  
1096  - `retrieved_documents`: List of retrieved documents from the previous retriever.
1097  - `window_size`: The number of documents to retrieve before and after the relevant one. This will overwrite
1098  the `window_size` parameter set in the constructor.
1099  
1100  **Returns**:
1101  
1102  A dictionary with the following keys:
1103  - `context_windows`: A list of strings, where each string represents the concatenated text from the
1104                       context window of the corresponding document in `retrieved_documents`.
1105  - `context_documents`: A list `Document` objects, containing the retrieved documents plus the context
1106                        document surrounding them. The documents are sorted by the `split_idx_start`
1107                        meta field.
1108