elasticsearchembeddingretriever.mdx
1 --- 2 title: "ElasticsearchEmbeddingRetriever" 3 id: elasticsearchembeddingretriever 4 slug: "/elasticsearchembeddingretriever" 5 description: "An embedding-based Retriever compatible with the Elasticsearch Document Store." 6 --- 7 8 # ElasticsearchEmbeddingRetriever 9 10 An embedding-based Retriever compatible with the Elasticsearch Document Store. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | 1. After a Text Embedder and before a [`PromptBuilder`](../builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in the semantic search pipeline 3. After a Text Embedder and before an [`ExtractiveReader`](../readers/extractivereader.mdx) in an extractive QA pipeline | 17 | **Mandatory init variables** | `document_store`: An instance of [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx) | 18 | **Mandatory run variables** | `query_embedding`: A list of floats | 19 | **Output variables** | `documents`: A list of documents | 20 | **API reference** | [Elasticsearch](/reference/integrations-elasticsearch) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch | 22 23 </div> 24 25 ## Overview 26 27 The `ElasticsearchEmbeddingRetriever` is an embedding-based Retriever compatible with the `ElasticsearchDocumentStore`. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the `ElasticsearchDocumentStore` based on the outcome. 28 29 When using the `ElasticsearchEmbeddingRetriever` in your NLP system, ensure it has the query and Document embeddings available. You can do so by adding a Document Embedder to your indexing pipeline and a Text Embedder to your query pipeline. 30 31 In addition to the `query_embedding`, the `ElasticsearchEmbeddingRetriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space. 32 33 When initializing Retriever, you can also set `num_candidates`: the number of approximate nearest neighbor candidates on each shard. It's an advanced setting you can read more about in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#tune-approximate-knn-for-speed-accuracy). 34 35 The `embedding_similarity_function` to use for embedding retrieval must be defined when the corresponding `ElasticsearchDocumentStore` is initialized. 36 37 ## Installation 38 39 [Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html) Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html) an instance. Haystack supports Elasticsearch 8. 40 41 If you have Docker set up, we recommend pulling the Docker image and running it. 42 43 ```shell 44 docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.1 45 docker run -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" -e "xpack.security.enabled=false" elasticsearch:8.11.1 46 ``` 47 48 As an alternative, you can go to [Elasticsearch integration GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch) and start a Docker container running Elasticsearch using the provided `docker-compose.yml`: 49 50 ```shell 51 docker compose up 52 ``` 53 54 Once you have a running Elasticsearch instance, install the `elasticsearch-haystack` integration: 55 56 ```shell 57 pip install elasticsearch-haystack 58 ``` 59 60 ## Usage 61 62 ### In a pipeline 63 64 Use this Retriever in a query Pipeline like this: 65 66 ```python 67 from haystack_integrations.components.retrievers.elasticsearch import ( 68 ElasticsearchEmbeddingRetriever, 69 ) 70 from haystack_integrations.document_stores.elasticsearch import ( 71 ElasticsearchDocumentStore, 72 ) 73 74 from haystack.document_stores.types import DuplicatePolicy 75 from haystack import Document, Pipeline 76 from haystack.components.embedders import ( 77 SentenceTransformersTextEmbedder, 78 SentenceTransformersDocumentEmbedder, 79 ) 80 81 document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200/") 82 83 model = "BAAI/bge-large-en-v1.5" 84 85 documents = [ 86 Document(content="There are over 7,000 languages spoken around the world today."), 87 Document( 88 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 89 ), 90 Document( 91 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 92 ), 93 ] 94 95 document_embedder = SentenceTransformersDocumentEmbedder(model=model) 96 documents_with_embeddings = document_embedder.run(documents) 97 98 document_store.write_documents( 99 documents_with_embeddings.get("documents"), 100 policy=DuplicatePolicy.SKIP, 101 ) 102 103 query_pipeline = Pipeline() 104 query_pipeline.add_component( 105 "text_embedder", 106 SentenceTransformersTextEmbedder(model=model), 107 ) 108 query_pipeline.add_component( 109 "retriever", 110 ElasticsearchEmbeddingRetriever(document_store=document_store), 111 ) 112 query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") 113 114 query = "How many languages are there?" 115 116 result = query_pipeline.run({"text_embedder": {"text": query}}) 117 118 print(result["retriever"]["documents"][0]) 119 ``` 120 121 The example output would be: 122 123 ```python 124 Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.87717235, embedding: vector of size 1024) 125 ```