Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / opensearchhybridretriever.mdx
opensearchhybridretriever.mdx
  1  ---
  2  title: "OpenSearchHybridRetriever"
  3  id: opensearchhybridretriever
  4  slug: "/opensearchhybridretriever"
  5  description: "This is a [SuperComponent](../../concepts/components/supercomponents.mdx) that implements a Hybrid Retriever in a single component, relying on OpenSearch as the backend Document Store."
  6  ---
  7  
  8  # OpenSearchHybridRetriever
  9  
 10  This is a [SuperComponent](../../concepts/components/supercomponents.mdx) that implements a Hybrid Retriever in a single component, relying on OpenSearch as the backend Document Store.
 11  
 12  A Hybrid Retriever uses both traditional keyword-based search (such as BM25) and embedding-based search to retrieve documents, combining the strengths of both approaches. The Retriever then merges and re-ranks the results from both methods.
 13  
 14  <div className="key-value-table">
 15  
 16  |  |  |
 17  | --- | --- |
 18  | Most common position in a pipeline | After an [OpenSearchDocumentStore](../../document-stores/opensearch-document-store.mdx) |
 19  | Mandatory init variables | `document_store`: An instance of `OpenSearchDocumentStore` to use for retrieval  <br /> <br />`embedder`: Any [Embedder](../embedders.mdx) implementing the `TextEmbedder` protocol |
 20  | Mandatory run variables | `query`: A query string |
 21  | Output variables | `documents`: A list of documents matching the query |
 22  | API reference | [OpenSearch](/reference/integrations-opensearch) |
 23  | GitHub | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/opensearch |
 24  
 25  </div>
 26  
 27  ## Overview
 28  
 29  The `OpenSearchHybridRetriever` combines two retrieval methods:
 30  
 31  1. **BM25 Retrieval**: A keyword-based search that uses the BM25 algorithm to find documents based on term frequency and inverse document frequency. It's based on the [`OpenSearchBM25Retriever`](opensearchbm25retriever.mdx) component and is suitable for traditional keyword-based search.
 32  2. **Embedding-based Retrieval**: A semantic search that uses vector similarity to find documents that are semantically similar to the query. It's based on the [`OpenSearchEmbeddingRetriever`](opensearchembeddingretriever.mdx) component and is suitable for semantic search.
 33  
 34  The component automatically handles:
 35  
 36  - Converting the query into an embedding using the provided embedded,
 37  - Running both retrieval methods in parallel,
 38  - Merging and re-ranking the results using the specified join mode.
 39  
 40  ### Setup and Installation
 41  
 42  ```shell
 43  pip install opensearch-haystack
 44  ```
 45  
 46  ### Optional Parameters
 47  
 48  This Retriever accepts various optional parameters. You can verify the most up-to-date list of parameters in our [API Reference](/reference/integrations-opensearch#opensearchhybridretriever).
 49  
 50  You can pass additional parameters to the underlying components using the `bm25_retriever` and `embedding_retriever` dictionaries.
 51  The `DocumentJoiner` parameters are all exposed on the `OpenSearchHybridRetriever` class, so you can set them directly.
 52  
 53  Here's an example:
 54  
 55  ```python
 56  retriever = OpenSearchHybridRetriever(
 57      document_store=document_store,
 58      embedder=embedder,
 59      bm25_retriever={"raise_on_failure": True},
 60      embedding_retriever={"raise_on_failure": False},
 61  )
 62  ```
 63  
 64  ## Usage
 65  
 66  ### On its own
 67  
 68  This Retriever needs the `OpensearchDocumentStore` populated with documents to run. You can’t use it on its own.
 69  
 70  ### In a pipeline
 71  
 72  Here's a basic example of how to use the `OpenSearchHybridRetriever`:
 73  
 74  You can use the following command to run OpenSearch locally using Docker. Make sure you have Docker installed and running on your machine. Note that this example disables the security plugin for simplicity. In a production environment, you should enable security features.
 75  
 76  ```dockerfile
 77  docker run -d \\
 78    --name opensearch-nosec \\
 79    -p 9200:9200 \\
 80    -p 9600:9600 \\
 81    -e "discovery.type=single-node" \\
 82    -e "DISABLE_SECURITY_PLUGIN=true" \\
 83    opensearchproject/opensearch:2.12.0
 84  ```
 85  
 86  ```python
 87  from haystack import Document
 88  from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
 89  from haystack_integrations.components.retrievers.opensearch import OpenSearchHybridRetriever
 90  from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore
 91  
 92  ## Initialize the document store
 93  doc_store = OpenSearchDocumentStore(
 94      hosts=["http://localhost:9200"],
 95      index="document_store",
 96      embedding_dim=384,
 97  )
 98  
 99  ## Create some sample documents
100  docs = [
101      Document(content="Machine learning is a subset of artificial intelligence."),
102      Document(content="Deep learning is a subset of machine learning."),
103      Document(content="Natural language processing is a field of AI."),
104      Document(content="Reinforcement learning is a type of machine learning."),
105      Document(content="Supervised learning is a type of machine learning."),
106  ]
107  
108  ## Embed the documents and add them to the document store
109  doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
110  docs = doc_embedder.run(docs)
111  doc_store.write_documents(docs['documents'])
112  
113  ## Initialize some haystack text embedder, in this case the SentenceTransformersTextEmbedder
114  embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
115  
116  ## Initialize the hybrid retriever
117  retriever = OpenSearchHybridRetriever(
118      document_store=doc_store,
119      embedder=embedder,
120      top_k_bm25=3,
121      top_k_embedding=3,
122      join_mode="reciprocal_rank_fusion"
123  )
124  
125  ## Run the retriever
126  results = retriever.run(query="What is reinforcement learning?", filters_bm25=None, filters_embedding=None)
127  
128  >> results['documents']
129  {'documents': [Document(id=..., content: 'Reinforcement learning is a type of machine learning.', score: 1.0),
130    Document(id=..., content: 'Supervised learning is a type of machine learning.', score: 0.9760624679979518),
131    Document(id=..., content: 'Deep learning is a subset of machine learning.', score: 0.4919354838709677),
132    Document(id=..., content: 'Machine learning is a subset of artificial intelligence.', score: 0.4841269841269841)]}
133  ```