elasticsearchbm25retriever.mdx
1 --- 2 title: "ElasticsearchBM25Retriever" 3 id: elasticsearchbm25retriever 4 slug: "/elasticsearchbm25retriever" 5 description: "A keyword-based Retriever that fetches Documents matching a query from the Elasticsearch Document Store." 6 --- 7 8 # ElasticsearchBM25Retriever 9 10 A keyword-based Retriever that fetches Documents matching a query from the Elasticsearch Document Store. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | 1. Before a [`PromptBuilder`](../builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in the semantic search pipeline 3. Before an [`ExtractiveReader`](../readers/extractivereader.mdx) in an extractive QA pipeline | 17 | **Mandatory init variables** | `document_store`: An instance of [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx) | 18 | **Mandatory run variables** | `query`: A string | 19 | **Output variables** | `documents`: A list of documents (matching the query) | 20 | **API reference** | [Elasticsearch](/reference/integrations-elasticsearch) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch | 22 23 </div> 24 25 ## Overview 26 27 `ElasticsearchBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from an `ElasticsearchDocumentStore`. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings. 28 29 Since the `ElasticsearchBM25Retriever` matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data. 30 31 In addition to the `query`, the `ElasticsearchBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space. 32 When initializing Retriever, you can also adjust how [inexact fuzzy matching](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness) is performed, using the `fuzziness` parameter. 33 34 If you want a semantic match between a query and documents, you can use `ElasticsearchEmbeddingRetriever`, which uses vectors created by embedding models to retrieve relevant information. 35 36 ## Installation 37 38 [Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html) Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html) an instance. Haystack supports Elasticsearch 8. 39 40 If you have Docker set up, we recommend pulling the Docker image and running it. 41 42 ```shell 43 docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.1 44 docker run -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" -e "xpack.security.enabled=false" elasticsearch:8.11.1 45 ``` 46 47 As an alternative, you can go to [Elasticsearch integration GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch) and start a Docker container running Elasticsearch using the provided `docker-compose.yml`: 48 49 ```shell 50 docker compose up 51 ``` 52 53 Once you have a running Elasticsearch instance, install the `elasticsearch-haystack` integration: 54 55 ```shell 56 pip install elasticsearch-haystack 57 ``` 58 59 ## Usage 60 61 ### On its own 62 63 ```python 64 from haystack import Document 65 from haystack_integrations.components.retrievers.elasticsearch import ( 66 ElasticsearchBM25Retriever, 67 ) 68 from haystack_integrations.document_stores.elasticsearch import ( 69 ElasticsearchDocumentStore, 70 ) 71 from elasticsearch import Elasticsearch 72 73 document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/") 74 documents = [ 75 Document(content="There are over 7,000 languages spoken around the world today."), 76 Document( 77 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 78 ), 79 Document( 80 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 81 ), 82 ] 83 document_store.write_documents(documents=documents) 84 85 retriever = ElasticsearchBM25Retriever(document_store=document_store) 86 retriever.run(query="How many languages are spoken around the world today?") 87 ``` 88 89 ### In a RAG pipeline 90 91 Set your `OPENAI_API_KEY` as an environment variable and then run the following code: 92 93 ```python 94 95 from haystack_integrations.components.retrievers.elasticsearch import ( 96 ElasticsearchBM25Retriever, 97 ) 98 from haystack_integrations.document_stores.elasticsearch import ( 99 ElasticsearchDocumentStore, 100 ) 101 102 from elasticsearch import Elasticsearch 103 104 from haystack import Document 105 from haystack import Pipeline 106 from haystack.components.builders.answer_builder import AnswerBuilder 107 from haystack.components.builders.prompt_builder import PromptBuilder 108 from haystack.components.generators import OpenAIGenerator 109 from haystack.document_stores.types import DuplicatePolicy 110 111 import os 112 113 api_key = os.environ["OPENAI_API_KEY"] 114 115 ## Create a RAG query pipeline 116 prompt_template = """ 117 Given these documents, answer the question.\nDocuments: 118 {% for doc in documents %} 119 {{ doc.content }} 120 {% endfor %} 121 122 \nQuestion: {{question}} 123 \nAnswer: 124 """ 125 126 document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/") 127 128 ## Add Documents 129 130 documents = [ 131 Document(content="There are over 7,000 languages spoken around the world today."), 132 Document( 133 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 134 ), 135 Document( 136 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 137 ), 138 ] 139 140 ## DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors 141 document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP) 142 143 retriever = ElasticsearchBM25Retriever(document_store=document_store) 144 rag_pipeline = Pipeline() 145 rag_pipeline.add_component(name="retriever", instance=retriever) 146 rag_pipeline.add_component( 147 instance=PromptBuilder(template=prompt_template), 148 name="prompt_builder", 149 ) 150 rag_pipeline.add_component(instance=OpenAIGenerator(api_key=api_key), name="llm") 151 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 152 rag_pipeline.connect("retriever", "prompt_builder.documents") 153 rag_pipeline.connect("prompt_builder", "llm") 154 rag_pipeline.connect("llm.replies", "answer_builder.replies") 155 rag_pipeline.connect("llm.meta", "answer_builder.meta") 156 rag_pipeline.connect("retriever", "answer_builder.documents") 157 158 question = "How many languages are spoken around the world today?" 159 result = rag_pipeline.run( 160 { 161 "retriever": {"query": question}, 162 "prompt_builder": {"question": question}, 163 "answer_builder": {"query": question}, 164 }, 165 ) 166 print(result["answer_builder"]["answers"][0].data) 167 ``` 168 169 Here’s an example output you might get: 170 171 ```python 172 "Over 7,000 languages are spoken around the world today" 173 ```