Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / elasticsearchbm25retriever.mdx
elasticsearchbm25retriever.mdx
  1  ---
  2  title: "ElasticsearchBM25Retriever"
  3  id: elasticsearchbm25retriever
  4  slug: "/elasticsearchbm25retriever"
  5  description: "A keyword-based Retriever that fetches Documents matching a query from the Elasticsearch Document Store."
  6  ---
  7  
  8  # ElasticsearchBM25Retriever
  9  
 10  A keyword-based Retriever that fetches Documents matching a query from the Elasticsearch Document Store.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | 1. Before a [`PromptBuilder`](../builders/promptbuilder.mdx)   in a RAG pipeline  2. The last component in the semantic search pipeline  3. Before an [`ExtractiveReader`](../readers/extractivereader.mdx)   in an extractive QA pipeline |
 17  | **Mandatory init variables**           | `document_store`: An instance of [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx)                                                                                                                       |
 18  | **Mandatory run variables**            | `query`: A string                                                                                                                                                                                                       |
 19  | **Output variables**                   | `documents`: A list of documents (matching the query)                                                                                                                                                                   |
 20  | **API reference**                      | [Elasticsearch](/reference/integrations-elasticsearch)                                                                                                                                                                         |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch                                                                                                                         |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  `ElasticsearchBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from an `ElasticsearchDocumentStore`. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.
 28  
 29  Since the `ElasticsearchBM25Retriever` matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.
 30  
 31  In addition to the `query`, the `ElasticsearchBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
 32  When initializing Retriever, you can also adjust how [inexact fuzzy matching](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness) is performed, using the `fuzziness` parameter.
 33  
 34  If you want a semantic match between a query and documents, you can use `ElasticsearchEmbeddingRetriever`, which uses vectors created by embedding models to retrieve relevant information.
 35  
 36  ## Installation
 37  
 38  [Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html) Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html) an instance. Haystack supports Elasticsearch 8.
 39  
 40  If you have Docker set up, we recommend pulling the Docker image and running it.
 41  
 42  ```shell
 43  docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.1
 44  docker run -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" -e "xpack.security.enabled=false" elasticsearch:8.11.1
 45  ```
 46  
 47  As an alternative, you can go to [Elasticsearch integration GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch) and start a Docker container running Elasticsearch using the provided `docker-compose.yml`:
 48  
 49  ```shell
 50  docker compose up
 51  ```
 52  
 53  Once you have a running Elasticsearch instance, install the `elasticsearch-haystack` integration:
 54  
 55  ```shell
 56  pip install elasticsearch-haystack
 57  ```
 58  
 59  ## Usage
 60  
 61  ### On its own
 62  
 63  ```python
 64  from haystack import Document
 65  from haystack_integrations.components.retrievers.elasticsearch import (
 66      ElasticsearchBM25Retriever,
 67  )
 68  from haystack_integrations.document_stores.elasticsearch import (
 69      ElasticsearchDocumentStore,
 70  )
 71  from elasticsearch import Elasticsearch
 72  
 73  document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/")
 74  documents = [
 75      Document(content="There are over 7,000 languages spoken around the world today."),
 76      Document(
 77          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
 78      ),
 79      Document(
 80          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
 81      ),
 82  ]
 83  document_store.write_documents(documents=documents)
 84  
 85  retriever = ElasticsearchBM25Retriever(document_store=document_store)
 86  retriever.run(query="How many languages are spoken around the world today?")
 87  ```
 88  
 89  ### In a RAG pipeline
 90  
 91  Set your `OPENAI_API_KEY` as an environment variable and then run the following code:
 92  
 93  ```python
 94  
 95  from haystack_integrations.components.retrievers.elasticsearch import (
 96      ElasticsearchBM25Retriever,
 97  )
 98  from haystack_integrations.document_stores.elasticsearch import (
 99      ElasticsearchDocumentStore,
100  )
101  
102  from elasticsearch import Elasticsearch
103  
104  from haystack import Document
105  from haystack import Pipeline
106  from haystack.components.builders.answer_builder import AnswerBuilder
107  from haystack.components.builders.prompt_builder import PromptBuilder
108  from haystack.components.generators import OpenAIGenerator
109  from haystack.document_stores.types import DuplicatePolicy
110  
111  import os
112  
113  api_key = os.environ["OPENAI_API_KEY"]
114  
115  ## Create a RAG query pipeline
116  prompt_template = """
117      Given these documents, answer the question.\nDocuments:
118      {% for doc in documents %}
119          {{ doc.content }}
120      {% endfor %}
121  
122      \nQuestion: {{question}}
123      \nAnswer:
124      """
125  
126  document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/")
127  
128  ## Add Documents
129  
130  documents = [
131      Document(content="There are over 7,000 languages spoken around the world today."),
132      Document(
133          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
134      ),
135      Document(
136          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
137      ),
138  ]
139  
140  ## DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
141  document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)
142  
143  retriever = ElasticsearchBM25Retriever(document_store=document_store)
144  rag_pipeline = Pipeline()
145  rag_pipeline.add_component(name="retriever", instance=retriever)
146  rag_pipeline.add_component(
147      instance=PromptBuilder(template=prompt_template),
148      name="prompt_builder",
149  )
150  rag_pipeline.add_component(instance=OpenAIGenerator(api_key=api_key), name="llm")
151  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
152  rag_pipeline.connect("retriever", "prompt_builder.documents")
153  rag_pipeline.connect("prompt_builder", "llm")
154  rag_pipeline.connect("llm.replies", "answer_builder.replies")
155  rag_pipeline.connect("llm.meta", "answer_builder.meta")
156  rag_pipeline.connect("retriever", "answer_builder.documents")
157  
158  question = "How many languages are spoken around the world today?"
159  result = rag_pipeline.run(
160      {
161          "retriever": {"query": question},
162          "prompt_builder": {"question": question},
163          "answer_builder": {"query": question},
164      },
165  )
166  print(result["answer_builder"]["answers"][0].data)
167  ```
168  
169  Here’s an example output you might get:
170  
171  ```python
172  "Over 7,000 languages are spoken around the world today"
173  ```