opensearchhybridretriever.mdx
1 --- 2 title: "OpenSearchHybridRetriever" 3 id: opensearchhybridretriever 4 slug: "/opensearchhybridretriever" 5 description: "This is a [SuperComponent](../../concepts/components/supercomponents.mdx) that implements a Hybrid Retriever in a single component, relying on OpenSearch as the backend Document Store." 6 --- 7 8 # OpenSearchHybridRetriever 9 10 This is a [SuperComponent](../../concepts/components/supercomponents.mdx) that implements a Hybrid Retriever in a single component, relying on OpenSearch as the backend Document Store. 11 12 A Hybrid Retriever uses both traditional keyword-based search (such as BM25) and embedding-based search to retrieve documents, combining the strengths of both approaches. The Retriever then merges and re-ranks the results from both methods. 13 14 <div className="key-value-table"> 15 16 | | | 17 | --- | --- | 18 | Most common position in a pipeline | After an [OpenSearchDocumentStore](../../document-stores/opensearch-document-store.mdx) | 19 | Mandatory init variables | `document_store`: An instance of `OpenSearchDocumentStore` to use for retrieval <br /> <br />`embedder`: Any [Embedder](../embedders.mdx) implementing the `TextEmbedder` protocol | 20 | Mandatory run variables | `query`: A query string | 21 | Output variables | `documents`: A list of documents matching the query | 22 | API reference | [OpenSearch](/reference/integrations-opensearch) | 23 | GitHub | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/opensearch | 24 25 </div> 26 27 ## Overview 28 29 The `OpenSearchHybridRetriever` combines two retrieval methods: 30 31 1. **BM25 Retrieval**: A keyword-based search that uses the BM25 algorithm to find documents based on term frequency and inverse document frequency. It's based on the [`OpenSearchBM25Retriever`](opensearchbm25retriever.mdx) component and is suitable for traditional keyword-based search. 32 2. **Embedding-based Retrieval**: A semantic search that uses vector similarity to find documents that are semantically similar to the query. It's based on the [`OpenSearchEmbeddingRetriever`](opensearchembeddingretriever.mdx) component and is suitable for semantic search. 33 34 The component automatically handles: 35 36 - Converting the query into an embedding using the provided embedded, 37 - Running both retrieval methods in parallel, 38 - Merging and re-ranking the results using the specified join mode. 39 40 ### Setup and Installation 41 42 ```shell 43 pip install opensearch-haystack 44 ``` 45 46 ### Optional Parameters 47 48 This Retriever accepts various optional parameters. You can verify the most up-to-date list of parameters in our [API Reference](/reference/integrations-opensearch#opensearchhybridretriever). 49 50 You can pass additional parameters to the underlying components using the `bm25_retriever` and `embedding_retriever` dictionaries. 51 The `DocumentJoiner` parameters are all exposed on the `OpenSearchHybridRetriever` class, so you can set them directly. 52 53 Here's an example: 54 55 ```python 56 retriever = OpenSearchHybridRetriever( 57 document_store=document_store, 58 embedder=embedder, 59 bm25_retriever={"raise_on_failure": True}, 60 embedding_retriever={"raise_on_failure": False}, 61 ) 62 ``` 63 64 ## Usage 65 66 ### On its own 67 68 This Retriever needs the `OpensearchDocumentStore` populated with documents to run. You can’t use it on its own. 69 70 ### In a pipeline 71 72 Here's a basic example of how to use the `OpenSearchHybridRetriever`: 73 74 You can use the following command to run OpenSearch locally using Docker. Make sure you have Docker installed and running on your machine. Note that this example disables the security plugin for simplicity. In a production environment, you should enable security features. 75 76 ```dockerfile 77 docker run -d \\ 78 --name opensearch-nosec \\ 79 -p 9200:9200 \\ 80 -p 9600:9600 \\ 81 -e "discovery.type=single-node" \\ 82 -e "DISABLE_SECURITY_PLUGIN=true" \\ 83 opensearchproject/opensearch:2.12.0 84 ``` 85 86 ```python 87 from haystack import Document 88 from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder 89 from haystack_integrations.components.retrievers.opensearch import OpenSearchHybridRetriever 90 from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore 91 92 ## Initialize the document store 93 doc_store = OpenSearchDocumentStore( 94 hosts=["http://localhost:9200"], 95 index="document_store", 96 embedding_dim=384, 97 ) 98 99 ## Create some sample documents 100 docs = [ 101 Document(content="Machine learning is a subset of artificial intelligence."), 102 Document(content="Deep learning is a subset of machine learning."), 103 Document(content="Natural language processing is a field of AI."), 104 Document(content="Reinforcement learning is a type of machine learning."), 105 Document(content="Supervised learning is a type of machine learning."), 106 ] 107 108 ## Embed the documents and add them to the document store 109 doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 110 docs = doc_embedder.run(docs) 111 doc_store.write_documents(docs['documents']) 112 113 ## Initialize some haystack text embedder, in this case the SentenceTransformersTextEmbedder 114 embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2") 115 116 ## Initialize the hybrid retriever 117 retriever = OpenSearchHybridRetriever( 118 document_store=doc_store, 119 embedder=embedder, 120 top_k_bm25=3, 121 top_k_embedding=3, 122 join_mode="reciprocal_rank_fusion" 123 ) 124 125 ## Run the retriever 126 results = retriever.run(query="What is reinforcement learning?", filters_bm25=None, filters_embedding=None) 127 128 >> results['documents'] 129 {'documents': [Document(id=..., content: 'Reinforcement learning is a type of machine learning.', score: 1.0), 130 Document(id=..., content: 'Supervised learning is a type of machine learning.', score: 0.9760624679979518), 131 Document(id=..., content: 'Deep learning is a subset of machine learning.', score: 0.4919354838709677), 132 Document(id=..., content: 'Machine learning is a subset of artificial intelligence.', score: 0.4841269841269841)]} 133 ```