Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / weaviatebm25retriever.mdx
weaviatebm25retriever.mdx
  1  ---
  2  title: "WeaviateBM25Retriever"
  3  id: weaviatebm25retriever
  4  slug: "/weaviatebm25retriever"
  5  description: "This is a keyword-based Retriever that fetches Documents matching a query from the Weaviate Document Store."
  6  ---
  7  
  8  # WeaviateBM25Retriever
  9  
 10  This is a keyword-based Retriever that fetches Documents matching a query from the Weaviate Document Store.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | 1. Before a [`PromptBuilder`](../builders/promptbuilder.mdx)   in a RAG pipeline 2. The last component in the semantic search pipeline 3. Before an [`ExtractiveReader`](../readers/extractivereader.mdx)   in an extractive QA pipeline |
 17  | **Mandatory init variables**           | `document_store`: An instance of a [WeaviateDocumentStore](../../document-stores/weaviatedocumentstore.mdx)                                                                                                                                 |
 18  | **Mandatory run variables**            | `query`: A string                                                                                                                                                                                                     |
 19  | **Output variables**                   | `documents`: A list of documents (matching the query)                                                                                                                                                                 |
 20  | **API reference**                      | [Weaviate](/reference/integrations-weaviate)                                                                                                                                                                                 |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/weaviate                                                                                                                            |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  `WeaviateBM25Retriever` is a keyword-based Retriever that fetches Documents matching a query from [`WeaviateDocumentStore`](../../document-stores/weaviatedocumentstore.mdx). It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the
 28  two strings.
 29  
 30  Since the `WeaviateBM25Retriever` matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Beating it with more complex embedding-based approaches on out-of-domain data can be hard.
 31  
 32  If you want a semantic match between a query and documents, use the [`WeaviateEmbeddingRetriever`](weaviateembeddingretriever.mdx), which uses vectors created by embedding models to retrieve relevant information.
 33  
 34  ### Parameters
 35  
 36  In addition to the `query`, the `WeaviateBM25Retriever` accepts other optional parameters, including `top_k` (the maximum number of Documents to retrieve) and `filters` to narrow down the search space.
 37  
 38  ### Usage
 39  
 40  ### Installation
 41  
 42  To start using Weaviate with Haystack, install the package with:
 43  
 44  ```shell
 45  pip install weaviate-haystack
 46  ```
 47  
 48  #### On its own
 49  
 50  This Retriever needs an instance of `WeaviateDocumentStore` and indexed Documents to run.
 51  
 52  ```python
 53  from haystack_integrations.document_stores.weaviate.document_store import (
 54      WeaviateDocumentStore,
 55  )
 56  from haystack_integrations.components.retrievers.weaviate import WeaviateBM25Retriever
 57  
 58  document_store = WeaviateDocumentStore(url="http://localhost:8080")
 59  
 60  retriever = WeaviateBM25Retriever(document_store=document_store)
 61  
 62  retriever.run(query="How to make a pizza", top_k=3)
 63  ```
 64  
 65  #### In a Pipeline
 66  
 67  ```python
 68  from haystack_integrations.document_stores.weaviate.document_store import (
 69      WeaviateDocumentStore,
 70  )
 71  from haystack_integrations.components.retrievers.weaviate import (
 72      WeaviateBM25Retriever,
 73  )
 74  
 75  from haystack import Document
 76  from haystack import Pipeline
 77  from haystack.components.builders.answer_builder import AnswerBuilder
 78  from haystack.components.builders.prompt_builder import PromptBuilder
 79  from haystack.components.generators import OpenAIGenerator
 80  from haystack.document_stores.types import DuplicatePolicy
 81  
 82  ## Create a RAG query pipeline
 83  prompt_template = """
 84      Given these documents, answer the question.\nDocuments:
 85      {% for doc in documents %}
 86          {{ doc.content }}
 87      {% endfor %}
 88  
 89      \nQuestion: {{question}}
 90      \nAnswer:
 91      """
 92  
 93  document_store = WeaviateDocumentStore(url="http://localhost:8080")
 94  
 95  ## Add Documents
 96  documents = [
 97      Document(content="There are over 7,000 languages spoken around the world today."),
 98      Document(
 99          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
100      ),
101      Document(
102          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
103      ),
104  ]
105  
106  ## DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
107  document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)
108  
109  rag_pipeline = Pipeline()
110  rag_pipeline.add_component(
111      name="retriever",
112      instance=WeaviateBM25Retriever(document_store=document_store),
113  )
114  rag_pipeline.add_component(
115      instance=PromptBuilder(template=prompt_template),
116      name="prompt_builder",
117  )
118  rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
119  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
120  rag_pipeline.connect("retriever", "prompt_builder.documents")
121  rag_pipeline.connect("prompt_builder", "llm")
122  rag_pipeline.connect("llm.replies", "answer_builder.replies")
123  rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
124  rag_pipeline.connect("retriever", "answer_builder.documents")
125  
126  question = "How many languages are spoken around the world today?"
127  result = rag_pipeline.run(
128      {
129          "retriever": {"query": question},
130          "prompt_builder": {"question": question},
131          "answer_builder": {"query": question},
132      },
133  )
134  print(result["answer_builder"]["answers"][0])
135  ```