Cradicle Explorer

/ docs-website / docs / pipeline-components / samplers / toppsampler.mdx
toppsampler.mdx
  1  ---
  2  title: "TopPSampler"
  3  id: toppsampler
  4  slug: "/toppsampler"
  5  description: "Uses nucleus sampling to filter documents."
  6  ---
  7  
  8  # TopPSampler
  9  
 10  Uses nucleus sampling to filter documents.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | After a [Ranker](../rankers.mdx)                                                                           |
 17  | **Mandatory init variables**           | `top_p`: A float between 0 and 1 representing the cumulative probability threshold for document selection |
 18  | **Mandatory run variables**            | `documents`: A list of documents                                                                          |
 19  | **Output variables**                   | `documents`: A list of documents                                                                          |
 20  | **API reference**                      | [Samplers](/reference/samplers-api)                                                                              |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/samplers/top_p.py                  |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  Top-P (nucleus) sampling is a method that helps identify and select a subset of documents based on their cumulative probabilities. Instead of choosing a fixed number of documents, this method focuses on a specified percentage of the highest cumulative probabilities within a list of documents. To put it simply, `TopPSampler` provides a way to efficiently select the most relevant documents based on their similarity to a given query.
 28  
 29  The practical goal of the `TopPSampler` is to return a list of documents that, in sum, have a score larger than the `top_p` value. So, for example, when `top_p` is set to a high value, more documents will be returned, which can result in more varied outputs. The value is typically set between 0 and 1. By default, the component uses documents' `score` fields to look at the similarity scores.
 30  
 31  The component’s `run()` method takes in a set of documents, calculates the similarity scores between the query and the documents, and then filters the documents based on the cumulative probability of these scores.
 32  
 33  ## Usage
 34  
 35  ### On its own
 36  
 37  ```python
 38  from haystack import Document
 39  from haystack.components.samplers import TopPSampler
 40  
 41  sampler = TopPSampler(top_p=0.99, score_field="similarity_score")
 42  docs = [
 43      Document(content="Berlin", meta={"similarity_score": -10.6}),
 44      Document(content="Belgrade", meta={"similarity_score": -8.9}),
 45      Document(content="Sarajevo", meta={"similarity_score": -4.6}),
 46  ]
 47  output = sampler.run(documents=docs)
 48  docs = output["documents"]
 49  print(docs)
 50  ```
 51  
 52  ### In a pipeline
 53  
 54  To best understand how can you use a `TopPSampler` and which components to pair it with, explore the following example.
 55  
 56  ```python
 57  # import necessary dependencies
 58  from haystack import Pipeline
 59  from haystack.components.builders import ChatPromptBuilder
 60  from haystack.components.fetchers import LinkContentFetcher
 61  from haystack.components.converters import HTMLToDocument
 62  from haystack.components.generators.chat import OpenAIChatGenerator
 63  from haystack.components.preprocessors import DocumentSplitter
 64  from haystack.components.rankers import SentenceTransformersSimilarityRanker
 65  from haystack.components.routers.file_type_router import FileTypeRouter
 66  from haystack.components.samplers import TopPSampler
 67  from haystack.components.websearch import SerperDevWebSearch
 68  from haystack.utils import Secret
 69  from haystack.dataclasses import ChatMessage
 70  
 71  # initialize the components
 72  web_search = SerperDevWebSearch(api_key=Secret.from_token("<your-api-key>"), top_k=10)
 73  
 74  lcf = LinkContentFetcher()
 75  html_converter = HTMLToDocument()
 76  router = FileTypeRouter(["text/html", "application/pdf", "application/octet-stream"])
 77  
 78  # ChatPromptBuilder uses a different template format with ChatMessage
 79  template = [
 80      ChatMessage.from_user(
 81          "Given these paragraphs below: \n {% for doc in documents %}{{ doc.content }}{% endfor %}\n\nAnswer the question: {{ query }}",
 82      ),
 83  ]
 84  # set required_variables to avoid warnings in multi-branch pipelines
 85  prompt_builder = ChatPromptBuilder(
 86      template=template,
 87      required_variables=["documents", "query"],
 88  )
 89  
 90  # The Ranker plays an important role, as it will assign the scores to the top 10 found documents based on our query. We will need these scores to work with the TopPSampler.
 91  similarity_ranker = SentenceTransformersSimilarityRanker(top_k=10)
 92  splitter = DocumentSplitter()
 93  # We are setting the top_p parameter to 0.95. This will help identify the most relevant documents to our query.
 94  top_p_sampler = TopPSampler(top_p=0.95)
 95  
 96  llm = OpenAIChatGenerator(api_key=Secret.from_token("<your-api-key>"))
 97  
 98  # create the pipeline and add the components to it
 99  pipe = Pipeline()
100  pipe.add_component("search", web_search)
101  pipe.add_component("fetcher", lcf)
102  pipe.add_component("router", router)
103  pipe.add_component("converter", html_converter)
104  pipe.add_component("splitter", splitter)
105  pipe.add_component("ranker", similarity_ranker)
106  pipe.add_component("sampler", top_p_sampler)
107  pipe.add_component("prompt_builder", prompt_builder)
108  pipe.add_component("llm", llm)
109  
110  # Arrange pipeline components in the order you need them. If a component has more than one inputs or outputs, indicate which input you want to connect to which output using the format ("component_name.output_name", "component_name, input_name").
111  pipe.connect("search.links", "fetcher.urls")
112  pipe.connect("fetcher.streams", "router.sources")
113  pipe.connect("router.text/html", "converter.sources")
114  pipe.connect("converter.documents", "splitter.documents")
115  pipe.connect("splitter.documents", "ranker.documents")
116  pipe.connect("ranker.documents", "sampler.documents")
117  pipe.connect("sampler.documents", "prompt_builder.documents")
118  pipe.connect("prompt_builder.prompt", "llm.messages")
119  
120  # run the pipeline
121  question = "Why are cats afraid of cucumbers?"
122  query_dict = {"query": question}
123  
124  result = pipe.run(
125      data={"search": query_dict, "prompt_builder": query_dict, "ranker": query_dict},
126  )
127  print(result)
128  ```