Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / inmemorybm25retriever.mdx
inmemorybm25retriever.mdx
  1  ---
  2  title: "InMemoryBM25Retriever"
  3  id: inmemorybm25retriever
  4  slug: "/inmemorybm25retriever"
  5  description: "A keyword-based Retriever compatible with InMemoryDocumentStore."
  6  ---
  7  
  8  # InMemoryBM25Retriever
  9  
 10  A keyword-based Retriever compatible with InMemoryDocumentStore.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | In query pipelines:  <br />In a RAG pipeline, before a [`PromptBuilder`](../builders/promptbuilder.mdx)  <br />In a semantic search pipeline, as the last component  <br />In an extractive QA pipeline, before an [`ExtractiveReader`](../readers/extractivereader.mdx) |
 17  | **Mandatory init variables** | `document_store`: An instance of [InMemoryDocumentStore](../../document-stores/inmemorydocumentstore.mdx) |
 18  | **Mandatory run variables** | `query`: A query string |
 19  | **Output variables** | `documents`: A list of documents (matching the query) |
 20  | **API reference** | [Retrievers](/reference/retrievers-api) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  `InMemoryBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.
 28  
 29  Since the `InMemoryBM25Retriever` matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.
 30  
 31  In addition to the `query`, the `InMemoryBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
 32  Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding `InMemoryDocumentStore` is initialized: these include the specific BM25 algorithm and its parameters.
 33  
 34  ## Usage
 35  
 36  ### On its own
 37  
 38  ```python
 39  from haystack import Document
 40  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 41  from haystack.document_stores.in_memory import InMemoryDocumentStore
 42  
 43  document_store = InMemoryDocumentStore()
 44  documents = [
 45      Document(content="There are over 7,000 languages spoken around the world today."),
 46      Document(
 47          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
 48      ),
 49      Document(
 50          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
 51      ),
 52  ]
 53  document_store.write_documents(documents=documents)
 54  
 55  retriever = InMemoryBM25Retriever(document_store=document_store)
 56  retriever.run(query="How many languages are spoken around the world today?")
 57  ```
 58  
 59  ### In a Pipeline
 60  
 61  #### In a RAG Pipeline
 62  
 63  Here's an example of the Retriever in a retrieval-augmented generation pipeline:
 64  
 65  ```python
 66  import os
 67  from haystack import Document
 68  from haystack import Pipeline
 69  from haystack.components.builders.answer_builder import AnswerBuilder
 70  from haystack.components.builders.prompt_builder import PromptBuilder
 71  from haystack.components.generators import OpenAIGenerator
 72  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 73  from haystack.document_stores.in_memory import InMemoryDocumentStore
 74  
 75  ## Create a RAG query pipeline
 76  prompt_template = """
 77      Given these documents, answer the question.\nDocuments:
 78      {% for doc in documents %}
 79          {{ doc.content }}
 80      {% endfor %}
 81  
 82      \nQuestion: {{question}}
 83      \nAnswer:
 84      """
 85  
 86  os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
 87  
 88  rag_pipeline = Pipeline()
 89  rag_pipeline.add_component(
 90      instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()),
 91      name="retriever",
 92  )
 93  rag_pipeline.add_component(
 94      instance=PromptBuilder(template=prompt_template),
 95      name="prompt_builder",
 96  )
 97  rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
 98  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
 99  rag_pipeline.connect("retriever", "prompt_builder.documents")
100  rag_pipeline.connect("prompt_builder", "llm")
101  rag_pipeline.connect("llm.replies", "answer_builder.replies")
102  rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
103  rag_pipeline.connect("retriever", "answer_builder.documents")
104  
105  ## Draw the pipeline
106  rag_pipeline.draw("./rag_pipeline.png")
107  
108  ## Add Documents
109  documents = [
110      Document(content="There are over 7,000 languages spoken around the world today."),
111      Document(
112          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
113      ),
114      Document(
115          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
116      ),
117  ]
118  rag_pipeline.get_component("retriever").document_store.write_documents(documents)
119  
120  ## Run the pipeline
121  question = "How many languages are there?"
122  result = rag_pipeline.run(
123      {
124          "retriever": {"query": question},
125          "prompt_builder": {"question": question},
126          "answer_builder": {"query": question},
127      },
128  )
129  print(result["answer_builder"]["answers"][0])
130  ```
131  
132  #### In a Document Search Pipeline
133  
134  Here's how you can use this Retriever in a document search pipeline:
135  
136  ```python
137  from haystack import Document
138  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
139  from haystack.document_stores.in_memory import InMemoryDocumentStore
140  from haystack.pipeline import Pipeline
141  
142  ## Create components and a query pipeline
143  document_store = InMemoryDocumentStore()
144  retriever = InMemoryBM25Retriever(document_store=document_store)
145  
146  pipeline = Pipeline()
147  pipeline.add_component(instance=retriever, name="retriever")
148  
149  ## Add Documents
150  documents = [
151      Document(content="There are over 7,000 languages spoken around the world today."),
152      Document(
153          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
154      ),
155      Document(
156          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
157      ),
158  ]
159  document_store.write_documents(documents)
160  
161  ## Run the pipeline
162  result = pipeline.run(data={"retriever": {"query": "How many languages are there?"}})
163  
164  print(result["retriever"]["documents"][0])
165  ```