inmemorybm25retriever.mdx
1 --- 2 title: "InMemoryBM25Retriever" 3 id: inmemorybm25retriever 4 slug: "/inmemorybm25retriever" 5 description: "A keyword-based Retriever compatible with InMemoryDocumentStore." 6 --- 7 8 # InMemoryBM25Retriever 9 10 A keyword-based Retriever compatible with InMemoryDocumentStore. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | In query pipelines: <br />In a RAG pipeline, before a [`PromptBuilder`](../builders/promptbuilder.mdx) <br />In a semantic search pipeline, as the last component <br />In an extractive QA pipeline, before an [`ExtractiveReader`](../readers/extractivereader.mdx) | 17 | **Mandatory init variables** | `document_store`: An instance of [InMemoryDocumentStore](../../document-stores/inmemorydocumentstore.mdx) | 18 | **Mandatory run variables** | `query`: A query string | 19 | **Output variables** | `documents`: A list of documents (matching the query) | 20 | **API reference** | [Retrievers](/reference/retrievers-api) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py | 22 23 </div> 24 25 ## Overview 26 27 `InMemoryBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings. 28 29 Since the `InMemoryBM25Retriever` matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data. 30 31 In addition to the `query`, the `InMemoryBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space. 32 Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding `InMemoryDocumentStore` is initialized: these include the specific BM25 algorithm and its parameters. 33 34 ## Usage 35 36 ### On its own 37 38 ```python 39 from haystack import Document 40 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 41 from haystack.document_stores.in_memory import InMemoryDocumentStore 42 43 document_store = InMemoryDocumentStore() 44 documents = [ 45 Document(content="There are over 7,000 languages spoken around the world today."), 46 Document( 47 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 48 ), 49 Document( 50 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 51 ), 52 ] 53 document_store.write_documents(documents=documents) 54 55 retriever = InMemoryBM25Retriever(document_store=document_store) 56 retriever.run(query="How many languages are spoken around the world today?") 57 ``` 58 59 ### In a Pipeline 60 61 #### In a RAG Pipeline 62 63 Here's an example of the Retriever in a retrieval-augmented generation pipeline: 64 65 ```python 66 import os 67 from haystack import Document 68 from haystack import Pipeline 69 from haystack.components.builders.answer_builder import AnswerBuilder 70 from haystack.components.builders.prompt_builder import PromptBuilder 71 from haystack.components.generators import OpenAIGenerator 72 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 73 from haystack.document_stores.in_memory import InMemoryDocumentStore 74 75 ## Create a RAG query pipeline 76 prompt_template = """ 77 Given these documents, answer the question.\nDocuments: 78 {% for doc in documents %} 79 {{ doc.content }} 80 {% endfor %} 81 82 \nQuestion: {{question}} 83 \nAnswer: 84 """ 85 86 os.environ["OPENAI_API_KEY"] = "sk-XXXXXX" 87 88 rag_pipeline = Pipeline() 89 rag_pipeline.add_component( 90 instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), 91 name="retriever", 92 ) 93 rag_pipeline.add_component( 94 instance=PromptBuilder(template=prompt_template), 95 name="prompt_builder", 96 ) 97 rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm") 98 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 99 rag_pipeline.connect("retriever", "prompt_builder.documents") 100 rag_pipeline.connect("prompt_builder", "llm") 101 rag_pipeline.connect("llm.replies", "answer_builder.replies") 102 rag_pipeline.connect("llm.metadata", "answer_builder.metadata") 103 rag_pipeline.connect("retriever", "answer_builder.documents") 104 105 ## Draw the pipeline 106 rag_pipeline.draw("./rag_pipeline.png") 107 108 ## Add Documents 109 documents = [ 110 Document(content="There are over 7,000 languages spoken around the world today."), 111 Document( 112 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 113 ), 114 Document( 115 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 116 ), 117 ] 118 rag_pipeline.get_component("retriever").document_store.write_documents(documents) 119 120 ## Run the pipeline 121 question = "How many languages are there?" 122 result = rag_pipeline.run( 123 { 124 "retriever": {"query": question}, 125 "prompt_builder": {"question": question}, 126 "answer_builder": {"query": question}, 127 }, 128 ) 129 print(result["answer_builder"]["answers"][0]) 130 ``` 131 132 #### In a Document Search Pipeline 133 134 Here's how you can use this Retriever in a document search pipeline: 135 136 ```python 137 from haystack import Document 138 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 139 from haystack.document_stores.in_memory import InMemoryDocumentStore 140 from haystack.pipeline import Pipeline 141 142 ## Create components and a query pipeline 143 document_store = InMemoryDocumentStore() 144 retriever = InMemoryBM25Retriever(document_store=document_store) 145 146 pipeline = Pipeline() 147 pipeline.add_component(instance=retriever, name="retriever") 148 149 ## Add Documents 150 documents = [ 151 Document(content="There are over 7,000 languages spoken around the world today."), 152 Document( 153 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 154 ), 155 Document( 156 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 157 ), 158 ] 159 document_store.write_documents(documents) 160 161 ## Run the pipeline 162 result = pipeline.run(data={"retriever": {"query": "How many languages are there?"}}) 163 164 print(result["retriever"]["documents"][0]) 165 ```