automergingretriever.mdx
1 --- 2 title: "AutoMergingRetriever" 3 id: automergingretriever 4 slug: "/automergingretriever" 5 description: "Use AutoMergingRetriever to improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query." 6 --- 7 8 # AutoMergingRetriever 9 10 Use AutoMergingRetriever to improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Used after the main Retriever component that returns hierarchical documents. | 17 | **Mandatory init variables** | `document_store`: Document Store from which to retrieve the parent documents | 18 | **Mandatory run variables** | `documents`: A list of leaf documents that were matched by a Retriever | 19 | **Output variables** | `documents`: A list resulting documents | 20 | **API reference** | [Retrievers](/reference/retrievers-api) | 21 | **GitHub link** | [https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py](https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py#L116) | 22 23 </div> 24 25 ## Overview 26 27 The `AutoMergingRetriever` is a component that works with a hierarchical document structure. It returns the parent documents instead of individual leaf documents when a certain threshold is met. 28 29 This can be particularly useful when working with paragraphs split into multiple chunks. When several chunks from the same paragraph match your query, the complete paragraph often provides more context and value than the individual pieces alone. 30 31 Here is how this Retriever works: 32 33 1. It requires documents to be organized in a tree structure, with leaf nodes stored in a document index - see [`HierarchicalDocumentSplitter`](../preprocessors/hierarchicaldocumentsplitter.mdx) documentation. 34 2. When searching, it counts how many leaf documents under the same parent match your query. 35 3. If this count exceeds your defined threshold, it returns the parent document instead of the individual leaves. 36 37 The `AutoMergingRetriever` can currently be used by the following Document Stores: 38 39 - [AstraDocumentStore](../../document-stores/astradocumentstore.mdx) 40 - [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx) 41 - [OpenSearchDocumentStore](../../document-stores/opensearch-document-store.mdx) 42 - [PgvectorDocumentStore](../../document-stores/pgvectordocumentstore.mdx) 43 - [QdrantDocumentStore](../../document-stores/qdrant-document-store.mdx) 44 45 ## Usage 46 47 ### On its own 48 49 ```python 50 from haystack import Document 51 from haystack.components.preprocessors import HierarchicalDocumentSplitter 52 from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever 53 from haystack.document_stores.in_memory import InMemoryDocumentStore 54 55 ## create a hierarchical document structure with 3 levels, where the parent document has 3 children 56 text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing." 57 original_document = Document(content=text) 58 builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word") 59 docs = builder.run([original_document])["documents"] 60 61 ## store level-1 parent documents and initialize the retriever 62 doc_store_parents = InMemoryDocumentStore() 63 for doc in docs["documents"]: 64 if doc.meta["children_ids"] and doc.meta["level"] == 1: 65 doc_store_parents.write_documents([doc]) 66 retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5) 67 68 ## assume we retrieved 2 leaf docs from the same parent, the parent document should be returned, 69 ## since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6)) 70 leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]] 71 docs = retriever.run(leaf_docs[4:6]) 72 >> {'documents': [Document(id=538..), 73 >> content: 'warm glow over the trees. Birds began to sing.', 74 >> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...', 75 >> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]} 76 ``` 77 78 ### In a pipeline 79 80 This is an example of a RAG Haystack pipeline. It first retrieves leaf-level document chunks using BM25, merges them into higher-level parent documents with `AutoMergingRetriever`, constructs a prompt, and generates an answer using OpenAI's chat model. 81 82 ```python 83 from typing import List, Tuple 84 from haystack import Document, Pipeline 85 from haystack_experimental.components.splitters import HierarchicalDocumentSplitter 86 from haystack.components.builders.answer_builder import AnswerBuilder 87 from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder 88 from haystack.components.generators.chat import OpenAIChatGenerator 89 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 90 from haystack.components.retrievers import AutoMergingRetriever 91 from haystack.document_stores.in_memory import InMemoryDocumentStore 92 from haystack.document_stores.types import DuplicatePolicy 93 from haystack.dataclasses import ChatMessage 94 95 96 def indexing( 97 documents: List[Document], 98 ) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]: 99 splitter = HierarchicalDocumentSplitter( 100 block_sizes={10, 3}, 101 split_overlap=0, 102 split_by="word", 103 ) 104 docs = splitter.run(documents) 105 106 leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1] 107 leaf_doc_store = InMemoryDocumentStore() 108 leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE) 109 110 parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0] 111 parent_doc_store = InMemoryDocumentStore() 112 parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE) 113 114 return leaf_doc_store, parent_doc_store 115 116 117 ## Add documents 118 docs = [ 119 Document(content="There are over 7,000 languages spoken around the world today."), 120 Document( 121 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 122 ), 123 Document( 124 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 125 ), 126 ] 127 128 leaf_docs, parent_docs = indexing(docs) 129 130 prompt_template = [ 131 ChatMessage.from_system("You are a helpful assistant."), 132 ChatMessage.from_user( 133 "Given these documents, answer the question.\nDocuments:\n" 134 "{% for doc in documents %}{{ doc.content }}{% endfor %}\n" 135 "Question: {{question}}\nAnswer:", 136 ), 137 ] 138 139 rag_pipeline = Pipeline() 140 rag_pipeline.add_component( 141 instance=InMemoryBM25Retriever(document_store=leaf_docs), 142 name="bm25_retriever", 143 ) 144 rag_pipeline.add_component( 145 instance=AutoMergingRetriever(parent_docs, threshold=0.6), 146 name="retriever", 147 ) 148 rag_pipeline.add_component( 149 instance=ChatPromptBuilder( 150 template=prompt_template, 151 required_variables={"question", "documents"}, 152 ), 153 name="prompt_builder", 154 ) 155 rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm") 156 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 157 158 rag_pipeline.connect("bm25_retriever.documents", "retriever.documents") 159 rag_pipeline.connect("retriever", "prompt_builder.documents") 160 rag_pipeline.connect("prompt_builder.messages", "llm.messages") 161 rag_pipeline.connect("llm.replies", "answer_builder.replies") 162 rag_pipeline.connect("retriever", "answer_builder.documents") 163 164 question = "How many languages are there?" 165 result = rag_pipeline.run( 166 { 167 "bm25_retriever": {"query": question}, 168 "prompt_builder": {"question": question}, 169 "answer_builder": {"query": question}, 170 }, 171 ) 172 ```