Cradicle Explorer

/ docs-website / docs / pipeline-components / retrievers / pgvectorkeywordretriever.mdx
pgvectorkeywordretriever.mdx
  1  ---
  2  title: "PgvectorKeywordRetriever"
  3  id: pgvectorkeywordretriever
  4  slug: "/pgvectorkeywordretriever"
  5  description: "This is a keyword-based Retriever that fetches documents matching a query from the Pgvector Document Store."
  6  ---
  7  
  8  # PgvectorKeywordRetriever
  9  
 10  This is a keyword-based Retriever that fetches documents matching a query from the Pgvector Document Store.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | 1. Before a [`PromptBuilder`](../builders/promptbuilder.mdx)   in a RAG pipeline 2. The last component in the semantic search pipeline 3. Before an [`ExtractiveReader`](../readers/extractivereader.mdx)   in an extractive QA pipeline |
 17  | **Mandatory init variables**           | `document_store`: An instance of a [PgvectorDocumentStore](../../document-stores/pgvectordocumentstore.mdx)                                                                                                                                 |
 18  | **Mandatory run variables**            | `query`:  A string                                                                                                                                                                                                    |
 19  | **Output variables**                   | `document`: A list of documents  (matching the query)                                                                                                                                                                 |
 20  | **API reference**                      | [Pgvector](/reference/integrations-pgvector)                                                                                                                                                                                 |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pgvector                                                                                                                            |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  The `PgvectorKeywordRetriever` is a keyword-based Retriever compatible with the `PgvectorDocumentStore`.
 28  
 29  The component uses the `ts_rank_cd` function of PostgreSQL to rank the documents.
 30  It considers how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.
 31  For more details, see [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING).
 32  
 33  Keep in mind that, unlike similar components such as `ElasticsearchBM25Retriever`, this Retriever does not apply fuzzy search out of the box, so it’s necessary to carefully formulate the query in order to avoid getting zero results.
 34  
 35  In addition to the `query`, the `PgvectorKeywordRetriever` accepts other optional parameters, including `top_k` (the maximum number of documents to retrieve) and `filters` to narrow the search space.
 36  
 37  ### Installation
 38  
 39  To quickly set up a PostgreSQL database with pgvector, you can use Docker:
 40  
 41  ```shell
 42  docker run -d -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=postgres ankane/pgvector
 43  ```
 44  
 45  For more information on how to install pgvector, visit the [pgvector GitHub repository](https://github.com/pgvector/pgvector).
 46  
 47  Install the `pgvector-haystack` integration:
 48  
 49  ```shell
 50  pip install pgvector-haystack
 51  ```
 52  
 53  ## Usage
 54  
 55  ### On its own
 56  
 57  This Retriever needs the `PgvectorDocumentStore` and indexed documents to run.
 58  
 59  Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
 60  
 61  ```python
 62  from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
 63  from haystack_integrations.components.retrievers.pgvector import (
 64      PgvectorKeywordRetriever,
 65  )
 66  
 67  document_store = PgvectorDocumentStore()
 68  retriever = PgvectorKeywordRetriever(document_store=document_store)
 69  
 70  retriever.run(query="my nice query")
 71  ```
 72  
 73  ### In a RAG pipeline
 74  
 75  The prerequisites necessary for running this code are:
 76  
 77  - Set an environment variable `OPENAI_API_KEY` with your OpenAI API key.
 78  - Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
 79  
 80  ```python
 81  from haystack import Document
 82  from haystack import Pipeline
 83  from haystack.components.builders.answer_builder import AnswerBuilder
 84  from haystack.components.builders.prompt_builder import PromptBuilder
 85  from haystack.components.generators import OpenAIGenerator
 86  from haystack.document_stores.types import DuplicatePolicy
 87  
 88  from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
 89  from haystack_integrations.components.retrievers.pgvector import (
 90      PgvectorKeywordRetriever,
 91  )
 92  
 93  ## Create a RAG query pipeline
 94  prompt_template = """
 95      Given these documents, answer the question.\nDocuments:
 96      {% for doc in documents %}
 97          {{ doc.content }}
 98      {% endfor %}
 99  
100      \nQuestion: {{question}}
101      \nAnswer:
102      """
103  
104  document_store = PgvectorDocumentStore(
105      language="english",  # this parameter influences text parsing for keyword retrieval
106      recreate_table=True,
107  )
108  
109  documents = [
110      Document(content="There are over 7,000 languages spoken around the world today."),
111      Document(
112          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
113      ),
114      Document(
115          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
116      ),
117  ]
118  
119  ## DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
120  document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)
121  
122  retriever = PgvectorKeywordRetriever(document_store=document_store)
123  rag_pipeline = Pipeline()
124  rag_pipeline.add_component(name="retriever", instance=retriever)
125  rag_pipeline.add_component(
126      instance=PromptBuilder(template=prompt_template),
127      name="prompt_builder",
128  )
129  rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
130  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
131  rag_pipeline.connect("retriever", "prompt_builder.documents")
132  rag_pipeline.connect("prompt_builder", "llm")
133  rag_pipeline.connect("llm.replies", "answer_builder.replies")
134  rag_pipeline.connect("llm.meta", "answer_builder.meta")
135  rag_pipeline.connect("retriever", "answer_builder.documents")
136  
137  question = "languages spoken around the world today"
138  result = rag_pipeline.run(
139      {
140          "retriever": {"query": question},
141          "prompt_builder": {"question": question},
142          "answer_builder": {"query": question},
143      },
144  )
145  print(result["answer_builder"])
146  ```