Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / automergingretriever.mdx
automergingretriever.mdx
  1  ---
  2  title: "AutoMergingRetriever"
  3  id: automergingretriever
  4  slug: "/automergingretriever"
  5  description: "Use AutoMergingRetriever to improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query."
  6  ---
  7  
  8  # AutoMergingRetriever
  9  
 10  Use AutoMergingRetriever to improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Used after the main Retriever component that returns hierarchical documents.                                                                                                                                                                                                                        |
 17  | **Mandatory init variables**           | `document_store`: Document Store from which to retrieve the parent documents                                                                                                                                                                                                                        |
 18  | **Mandatory run variables**            | `documents`: A list of leaf documents that were matched by a Retriever                                                                                                                                                                                                                              |
 19  | **Output variables**                   | `documents`: A list resulting documents                                                                                                                                                                                                                                                             |
 20  | **API reference**                      | [Retrievers](/reference/retrievers-api)                                                                                                                                                                                                                                                                    |
 21  | **GitHub link**                        | [https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py](https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py#L116) |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  The `AutoMergingRetriever` is a component that works with a hierarchical document structure. It returns the parent documents instead of individual leaf documents when a certain threshold is met.
 28  
 29  This can be particularly useful when working with paragraphs split into multiple chunks. When several chunks from the same paragraph match your query, the complete paragraph often provides more context and value than the individual pieces alone.
 30  
 31  Here is how this Retriever works:
 32  
 33  1. It requires documents to be organized in a tree structure, with leaf nodes stored in a document index - see [`HierarchicalDocumentSplitter`](../preprocessors/hierarchicaldocumentsplitter.mdx) documentation.
 34  2. When searching, it counts how many leaf documents under the same parent match your query.
 35  3. If this count exceeds your defined threshold, it returns the parent document instead of the individual leaves.
 36  
 37  The `AutoMergingRetriever` can currently be used by the following Document Stores:
 38  
 39  - [AstraDocumentStore](../../document-stores/astradocumentstore.mdx)
 40  - [ElasticsearchDocumentStore](../../document-stores/elasticsearch-document-store.mdx)
 41  - [OpenSearchDocumentStore](../../document-stores/opensearch-document-store.mdx)
 42  - [PgvectorDocumentStore](../../document-stores/pgvectordocumentstore.mdx)
 43  - [QdrantDocumentStore](../../document-stores/qdrant-document-store.mdx)
 44  
 45  ## Usage
 46  
 47  ### On its own
 48  
 49  ```python
 50  from haystack import Document
 51  from haystack.components.preprocessors import HierarchicalDocumentSplitter
 52  from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
 53  from haystack.document_stores.in_memory import InMemoryDocumentStore
 54  
 55  ## create a hierarchical document structure with 3 levels, where the parent document has 3 children
 56  text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
 57  original_document = Document(content=text)
 58  builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
 59  docs = builder.run([original_document])["documents"]
 60  
 61  ## store level-1 parent documents and initialize the retriever
 62  doc_store_parents = InMemoryDocumentStore()
 63  for doc in docs["documents"]:
 64      if doc.meta["children_ids"] and doc.meta["level"] == 1:
 65          doc_store_parents.write_documents([doc])
 66  retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
 67  
 68  ## assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
 69  ## since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
 70  leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
 71  docs = retriever.run(leaf_docs[4:6])
 72  >> {'documents': [Document(id=538..),
 73  >> content: 'warm glow over the trees. Birds began to sing.',
 74  >> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
 75  >> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
 76  ```
 77  
 78  ### In a pipeline
 79  
 80  This is an example of a RAG Haystack pipeline. It first retrieves leaf-level document chunks using BM25, merges them into higher-level parent documents with `AutoMergingRetriever`, constructs a prompt, and generates an answer using OpenAI's chat model.
 81  
 82  ```python
 83  from typing import List, Tuple
 84  from haystack import Document, Pipeline
 85  from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
 86  from haystack.components.builders.answer_builder import AnswerBuilder
 87  from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
 88  from haystack.components.generators.chat import OpenAIChatGenerator
 89  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 90  from haystack.components.retrievers import AutoMergingRetriever
 91  from haystack.document_stores.in_memory import InMemoryDocumentStore
 92  from haystack.document_stores.types import DuplicatePolicy
 93  from haystack.dataclasses import ChatMessage
 94  
 95  
 96  def indexing(
 97      documents: List[Document],
 98  ) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
 99      splitter = HierarchicalDocumentSplitter(
100          block_sizes={10, 3},
101          split_overlap=0,
102          split_by="word",
103      )
104      docs = splitter.run(documents)
105  
106      leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
107      leaf_doc_store = InMemoryDocumentStore()
108      leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)
109  
110      parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
111      parent_doc_store = InMemoryDocumentStore()
112      parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)
113  
114      return leaf_doc_store, parent_doc_store
115  
116  
117  ## Add documents
118  docs = [
119      Document(content="There are over 7,000 languages spoken around the world today."),
120      Document(
121          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
122      ),
123      Document(
124          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
125      ),
126  ]
127  
128  leaf_docs, parent_docs = indexing(docs)
129  
130  prompt_template = [
131      ChatMessage.from_system("You are a helpful assistant."),
132      ChatMessage.from_user(
133          "Given these documents, answer the question.\nDocuments:\n"
134          "{% for doc in documents %}{{ doc.content }}{% endfor %}\n"
135          "Question: {{question}}\nAnswer:",
136      ),
137  ]
138  
139  rag_pipeline = Pipeline()
140  rag_pipeline.add_component(
141      instance=InMemoryBM25Retriever(document_store=leaf_docs),
142      name="bm25_retriever",
143  )
144  rag_pipeline.add_component(
145      instance=AutoMergingRetriever(parent_docs, threshold=0.6),
146      name="retriever",
147  )
148  rag_pipeline.add_component(
149      instance=ChatPromptBuilder(
150          template=prompt_template,
151          required_variables={"question", "documents"},
152      ),
153      name="prompt_builder",
154  )
155  rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")
156  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
157  
158  rag_pipeline.connect("bm25_retriever.documents", "retriever.documents")
159  rag_pipeline.connect("retriever", "prompt_builder.documents")
160  rag_pipeline.connect("prompt_builder.messages", "llm.messages")
161  rag_pipeline.connect("llm.replies", "answer_builder.replies")
162  rag_pipeline.connect("retriever", "answer_builder.documents")
163  
164  question = "How many languages are there?"
165  result = rag_pipeline.run(
166      {
167          "bm25_retriever": {"query": question},
168          "prompt_builder": {"question": question},
169          "answer_builder": {"query": question},
170      },
171  )
172  ```