Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / elasticsearchembeddingretriever.mdx
elasticsearchembeddingretriever.mdx
  1  ---
  2  title: "ElasticsearchEmbeddingRetriever"
  3  id: elasticsearchembeddingretriever
  4  slug: "/elasticsearchembeddingretriever"
  5  description: "An embedding-based Retriever compatible with the Elasticsearch Document Store."
  6  ---
  7  
  8  # ElasticsearchEmbeddingRetriever
  9  
 10  An embedding-based Retriever compatible with the Elasticsearch Document Store.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | 1. After a Text Embedder and before a [`PromptBuilder`](../builders/promptbuilder.mdx)  in a RAG pipeline 2. The last component in the semantic search pipeline 3. After a Text Embedder and before an [`ExtractiveReader`](../readers/extractivereader.mdx)  in an extractive QA pipeline |
 17  | **Mandatory init variables**           | `document_store`: An instance of [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx)                                                                                                                                                                       |
 18  | **Mandatory run variables**            | `query_embedding`: A list of floats                                                                                                                                                                                                                                     |
 19  | **Output variables**                   | `documents`: A list of documents                                                                                                                                                                                                                                        |
 20  | **API reference**                      | [Elasticsearch](/reference/integrations-elasticsearch)                                                                                                                                                                                                                         |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch                                                                                                                                                                         |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  The `ElasticsearchEmbeddingRetriever` is an embedding-based Retriever compatible with the `ElasticsearchDocumentStore`. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the `ElasticsearchDocumentStore` based on the outcome.
 28  
 29  When using the `ElasticsearchEmbeddingRetriever` in your NLP system, ensure it has the query and Document embeddings available. You can do so by adding a Document Embedder to your indexing pipeline and a Text Embedder to your query pipeline.
 30  
 31  In addition to the `query_embedding`, the `ElasticsearchEmbeddingRetriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
 32  
 33  When initializing Retriever, you can also set `num_candidates`: the number of approximate nearest neighbor candidates on each shard. It's an advanced setting you can read more about in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#tune-approximate-knn-for-speed-accuracy).
 34  
 35  The `embedding_similarity_function` to use for embedding retrieval must be defined when the corresponding `ElasticsearchDocumentStore` is initialized.
 36  
 37  ## Installation
 38  
 39  [Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html) Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html) an instance. Haystack supports Elasticsearch 8.
 40  
 41  If you have Docker set up, we recommend pulling the Docker image and running it.
 42  
 43  ```shell
 44  docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.1
 45  docker run -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" -e "xpack.security.enabled=false" elasticsearch:8.11.1
 46  ```
 47  
 48  As an alternative, you can go to [Elasticsearch integration GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch) and start a Docker container running Elasticsearch using the provided `docker-compose.yml`:
 49  
 50  ```shell
 51  docker compose up
 52  ```
 53  
 54  Once you have a running Elasticsearch instance, install the `elasticsearch-haystack` integration:
 55  
 56  ```shell
 57  pip install elasticsearch-haystack
 58  ```
 59  
 60  ## Usage
 61  
 62  ### In a pipeline
 63  
 64  Use this Retriever in a query Pipeline like this:
 65  
 66  ```python
 67  from haystack_integrations.components.retrievers.elasticsearch import (
 68      ElasticsearchEmbeddingRetriever,
 69  )
 70  from haystack_integrations.document_stores.elasticsearch import (
 71      ElasticsearchDocumentStore,
 72  )
 73  
 74  from haystack.document_stores.types import DuplicatePolicy
 75  from haystack import Document, Pipeline
 76  from haystack.components.embedders import (
 77      SentenceTransformersTextEmbedder,
 78      SentenceTransformersDocumentEmbedder,
 79  )
 80  
 81  document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/")
 82  
 83  model = "BAAI/bge-large-en-v1.5"
 84  
 85  documents = [
 86      Document(content="There are over 7,000 languages spoken around the world today."),
 87      Document(
 88          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
 89      ),
 90      Document(
 91          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
 92      ),
 93  ]
 94  
 95  document_embedder = SentenceTransformersDocumentEmbedder(model=model)
 96  documents_with_embeddings = document_embedder.run(documents)
 97  
 98  document_store.write_documents(
 99      documents_with_embeddings.get("documents"),
100      policy=DuplicatePolicy.SKIP,
101  )
102  
103  query_pipeline = Pipeline()
104  query_pipeline.add_component(
105      "text_embedder",
106      SentenceTransformersTextEmbedder(model=model),
107  )
108  query_pipeline.add_component(
109      "retriever",
110      ElasticsearchEmbeddingRetriever(document_store=document_store),
111  )
112  query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
113  
114  query = "How many languages are there?"
115  
116  result = query_pipeline.run({"text_embedder": {"text": query}})
117  
118  print(result["retriever"]["documents"][0])
119  ```
120  
121  The example output would be:
122  
123  ```python
124  Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.87717235, embedding: vector of size 1024)
125  ```