toppsampler.mdx
1 --- 2 title: "TopPSampler" 3 id: toppsampler 4 slug: "/toppsampler" 5 description: "Uses nucleus sampling to filter documents." 6 --- 7 8 # TopPSampler 9 10 Uses nucleus sampling to filter documents. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | After a [Ranker](../rankers.mdx) | 17 | **Mandatory init variables** | `top_p`: A float between 0 and 1 representing the cumulative probability threshold for document selection | 18 | **Mandatory run variables** | `documents`: A list of documents | 19 | **Output variables** | `documents`: A list of documents | 20 | **API reference** | [Samplers](/reference/samplers-api) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/samplers/top_p.py | 22 23 </div> 24 25 ## Overview 26 27 Top-P (nucleus) sampling is a method that helps identify and select a subset of documents based on their cumulative probabilities. Instead of choosing a fixed number of documents, this method focuses on a specified percentage of the highest cumulative probabilities within a list of documents. To put it simply, `TopPSampler` provides a way to efficiently select the most relevant documents based on their similarity to a given query. 28 29 The practical goal of the `TopPSampler` is to return a list of documents that, in sum, have a score larger than the `top_p` value. So, for example, when `top_p` is set to a high value, more documents will be returned, which can result in more varied outputs. The value is typically set between 0 and 1. By default, the component uses documents' `score` fields to look at the similarity scores. 30 31 The component’s `run()` method takes in a set of documents, calculates the similarity scores between the query and the documents, and then filters the documents based on the cumulative probability of these scores. 32 33 ## Usage 34 35 ### On its own 36 37 ```python 38 from haystack import Document 39 from haystack.components.samplers import TopPSampler 40 41 sampler = TopPSampler(top_p=0.99, score_field="similarity_score") 42 docs = [ 43 Document(content="Berlin", meta={"similarity_score": -10.6}), 44 Document(content="Belgrade", meta={"similarity_score": -8.9}), 45 Document(content="Sarajevo", meta={"similarity_score": -4.6}), 46 ] 47 output = sampler.run(documents=docs) 48 docs = output["documents"] 49 print(docs) 50 ``` 51 52 ### In a pipeline 53 54 To best understand how can you use a `TopPSampler` and which components to pair it with, explore the following example. 55 56 ```python 57 # import necessary dependencies 58 from haystack import Pipeline 59 from haystack.components.builders import ChatPromptBuilder 60 from haystack.components.fetchers import LinkContentFetcher 61 from haystack.components.converters import HTMLToDocument 62 from haystack.components.generators.chat import OpenAIChatGenerator 63 from haystack.components.preprocessors import DocumentSplitter 64 from haystack.components.rankers import SentenceTransformersSimilarityRanker 65 from haystack.components.routers.file_type_router import FileTypeRouter 66 from haystack.components.samplers import TopPSampler 67 from haystack.components.websearch import SerperDevWebSearch 68 from haystack.utils import Secret 69 from haystack.dataclasses import ChatMessage 70 71 # initialize the components 72 web_search = SerperDevWebSearch(api_key=Secret.from_token("<your-api-key>"), top_k=10) 73 74 lcf = LinkContentFetcher() 75 html_converter = HTMLToDocument() 76 router = FileTypeRouter(["text/html", "application/pdf", "application/octet-stream"]) 77 78 # ChatPromptBuilder uses a different template format with ChatMessage 79 template = [ 80 ChatMessage.from_user( 81 "Given these paragraphs below: \n {% for doc in documents %}{{ doc.content }}{% endfor %}\n\nAnswer the question: {{ query }}", 82 ), 83 ] 84 # set required_variables to avoid warnings in multi-branch pipelines 85 prompt_builder = ChatPromptBuilder( 86 template=template, 87 required_variables=["documents", "query"], 88 ) 89 90 # The Ranker plays an important role, as it will assign the scores to the top 10 found documents based on our query. We will need these scores to work with the TopPSampler. 91 similarity_ranker = SentenceTransformersSimilarityRanker(top_k=10) 92 splitter = DocumentSplitter() 93 # We are setting the top_p parameter to 0.95. This will help identify the most relevant documents to our query. 94 top_p_sampler = TopPSampler(top_p=0.95) 95 96 llm = OpenAIChatGenerator(api_key=Secret.from_token("<your-api-key>")) 97 98 # create the pipeline and add the components to it 99 pipe = Pipeline() 100 pipe.add_component("search", web_search) 101 pipe.add_component("fetcher", lcf) 102 pipe.add_component("router", router) 103 pipe.add_component("converter", html_converter) 104 pipe.add_component("splitter", splitter) 105 pipe.add_component("ranker", similarity_ranker) 106 pipe.add_component("sampler", top_p_sampler) 107 pipe.add_component("prompt_builder", prompt_builder) 108 pipe.add_component("llm", llm) 109 110 # Arrange pipeline components in the order you need them. If a component has more than one inputs or outputs, indicate which input you want to connect to which output using the format ("component_name.output_name", "component_name, input_name"). 111 pipe.connect("search.links", "fetcher.urls") 112 pipe.connect("fetcher.streams", "router.sources") 113 pipe.connect("router.text/html", "converter.sources") 114 pipe.connect("converter.documents", "splitter.documents") 115 pipe.connect("splitter.documents", "ranker.documents") 116 pipe.connect("ranker.documents", "sampler.documents") 117 pipe.connect("sampler.documents", "prompt_builder.documents") 118 pipe.connect("prompt_builder.prompt", "llm.messages") 119 120 # run the pipeline 121 question = "Why are cats afraid of cucumbers?" 122 query_dict = {"query": question} 123 124 result = pipe.run( 125 data={"search": query_dict, "prompt_builder": query_dict, "ranker": query_dict}, 126 ) 127 print(result) 128 ```