pgvectorkeywordretriever.mdx
1 --- 2 title: "PgvectorKeywordRetriever" 3 id: pgvectorkeywordretriever 4 slug: "/pgvectorkeywordretriever" 5 description: "This is a keyword-based Retriever that fetches documents matching a query from the Pgvector Document Store." 6 --- 7 8 # PgvectorKeywordRetriever 9 10 This is a keyword-based Retriever that fetches documents matching a query from the Pgvector Document Store. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | 1. Before a [`PromptBuilder`](../builders/promptbuilder.mdx) in a RAG pipeline 2. The last component in the semantic search pipeline 3. Before an [`ExtractiveReader`](../readers/extractivereader.mdx) in an extractive QA pipeline | 17 | **Mandatory init variables** | `document_store`: An instance of a [PgvectorDocumentStore](../../document-stores/pgvectordocumentstore.mdx) | 18 | **Mandatory run variables** | `query`: A string | 19 | **Output variables** | `document`: A list of documents (matching the query) | 20 | **API reference** | [Pgvector](/reference/integrations-pgvector) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pgvector | 22 23 </div> 24 25 ## Overview 26 27 The `PgvectorKeywordRetriever` is a keyword-based Retriever compatible with the `PgvectorDocumentStore`. 28 29 The component uses the `ts_rank_cd` function of PostgreSQL to rank the documents. 30 It considers how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. 31 For more details, see [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING). 32 33 Keep in mind that, unlike similar components such as `ElasticsearchBM25Retriever`, this Retriever does not apply fuzzy search out of the box, so it’s necessary to carefully formulate the query in order to avoid getting zero results. 34 35 In addition to the `query`, the `PgvectorKeywordRetriever` accepts other optional parameters, including `top_k` (the maximum number of documents to retrieve) and `filters` to narrow the search space. 36 37 ### Installation 38 39 To quickly set up a PostgreSQL database with pgvector, you can use Docker: 40 41 ```shell 42 docker run -d -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=postgres ankane/pgvector 43 ``` 44 45 For more information on how to install pgvector, visit the [pgvector GitHub repository](https://github.com/pgvector/pgvector). 46 47 Install the `pgvector-haystack` integration: 48 49 ```shell 50 pip install pgvector-haystack 51 ``` 52 53 ## Usage 54 55 ### On its own 56 57 This Retriever needs the `PgvectorDocumentStore` and indexed documents to run. 58 59 Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 60 61 ```python 62 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 63 from haystack_integrations.components.retrievers.pgvector import ( 64 PgvectorKeywordRetriever, 65 ) 66 67 document_store = PgvectorDocumentStore() 68 retriever = PgvectorKeywordRetriever(document_store=document_store) 69 70 retriever.run(query="my nice query") 71 ``` 72 73 ### In a RAG pipeline 74 75 The prerequisites necessary for running this code are: 76 77 - Set an environment variable `OPENAI_API_KEY` with your OpenAI API key. 78 - Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 79 80 ```python 81 from haystack import Document 82 from haystack import Pipeline 83 from haystack.components.builders.answer_builder import AnswerBuilder 84 from haystack.components.builders.prompt_builder import PromptBuilder 85 from haystack.components.generators import OpenAIGenerator 86 from haystack.document_stores.types import DuplicatePolicy 87 88 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 89 from haystack_integrations.components.retrievers.pgvector import ( 90 PgvectorKeywordRetriever, 91 ) 92 93 ## Create a RAG query pipeline 94 prompt_template = """ 95 Given these documents, answer the question.\nDocuments: 96 {% for doc in documents %} 97 {{ doc.content }} 98 {% endfor %} 99 100 \nQuestion: {{question}} 101 \nAnswer: 102 """ 103 104 document_store = PgvectorDocumentStore( 105 language="english", # this parameter influences text parsing for keyword retrieval 106 recreate_table=True, 107 ) 108 109 documents = [ 110 Document(content="There are over 7,000 languages spoken around the world today."), 111 Document( 112 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 113 ), 114 Document( 115 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 116 ), 117 ] 118 119 ## DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors 120 document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP) 121 122 retriever = PgvectorKeywordRetriever(document_store=document_store) 123 rag_pipeline = Pipeline() 124 rag_pipeline.add_component(name="retriever", instance=retriever) 125 rag_pipeline.add_component( 126 instance=PromptBuilder(template=prompt_template), 127 name="prompt_builder", 128 ) 129 rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm") 130 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 131 rag_pipeline.connect("retriever", "prompt_builder.documents") 132 rag_pipeline.connect("prompt_builder", "llm") 133 rag_pipeline.connect("llm.replies", "answer_builder.replies") 134 rag_pipeline.connect("llm.meta", "answer_builder.meta") 135 rag_pipeline.connect("retriever", "answer_builder.documents") 136 137 question = "languages spoken around the world today" 138 result = rag_pipeline.run( 139 { 140 "retriever": {"query": question}, 141 "prompt_builder": {"question": question}, 142 "answer_builder": {"query": question}, 143 }, 144 ) 145 print(result["answer_builder"]) 146 ```