Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.18 / experiments-api / experimental_summarizer_api.md
experimental_summarizer_api.md
  1  ---
  2  title: "Summarizers"
  3  id: experimental-summarizers-api
  4  description: "Components that summarize texts into concise versions."
  5  slug: "/experimental-summarizers-api"
  6  ---
  7  
  8  <a id="haystack_experimental.components.summarizers.llm_summarizer"></a>
  9  
 10  ## Module haystack\_experimental.components.summarizers.llm\_summarizer
 11  
 12  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer"></a>
 13  
 14  ### LLMSummarizer
 15  
 16  Summarizes text using a language model.
 17  
 18  It's inspired by code from the OpenAI blog post: https://cookbook.openai.com/examples/summarizing_long_documents
 19  
 20  Example
 21  ```python
 22  from haystack_experimental.components.summarizers.summarizer import Summarizer
 23  from haystack.components.generators.chat import OpenAIChatGenerator
 24  from haystack import Document
 25  
 26  text = ("Machine learning is a subset of artificial intelligence that provides systems "
 27          "the ability to automatically learn and improve from experience without being "
 28          "explicitly programmed. The process of learning begins with observations or data. "
 29          "Supervised learning algorithms build a mathematical model of sample data, known as "
 30          "training data, in order to make predictions or decisions. Unsupervised learning "
 31          "algorithms take a set of data that contains only inputs and find structure in the data. "
 32          "Reinforcement learning is an area of machine learning where an agent learns to behave "
 33          "in an environment by performing actions and seeing the results. Deep learning uses "
 34          "artificial neural networks to model complex patterns in data. Neural networks consist "
 35          "of layers of connected nodes, each performing a simple computation.")
 36  
 37  doc = Document(content=text)
 38  chat_generator = OpenAIChatGenerator(model="gpt-4")
 39  summarizer = Summarizer(chat_generator=chat_generator)
 40  summarizer.run(documents=[doc])
 41  ```
 42  
 43  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.__init__"></a>
 44  
 45  #### LLMSummarizer.\_\_init\_\_
 46  
 47  ```python
 48  def __init__(chat_generator: ChatGenerator,
 49               system_prompt: str
 50               | None = "Rewrite this text in summarized form.",
 51               summary_detail: float = 0,
 52               minimum_chunk_size: int | None = 500,
 53               chunk_delimiter: str = ".",
 54               summarize_recursively: bool = False,
 55               split_overlap: int = 0)
 56  ```
 57  
 58  Initialize the Summarizer component.
 59  
 60  :param chat_generator: A ChatGenerator instance to use for summarization.
 61          :param system_prompt: The prompt to instruct the LLM to summarise text, if not given defaults to:
 62              "Rewrite this text in summarized form."
 63          :param summary_detail: The level of detail for the summary (0-1), defaults to 0.
 64              This parameter controls the trade-off between conciseness and completeness by adjusting how many
 65              chunks the text is divided into. At detail=0, the text is processed as a single chunk (or very few
 66              chunks), producing the most concise summary. At detail=1, the text is split into the maximum number
 67              of chunks allowed by minimum_chunk_size, enabling more granular analysis and detailed summaries.
 68              The formula uses linear interpolation: num_chunks = 1 + detail * (max_chunks - 1), where max_chunks
 69              is determined by dividing the document length by minimum_chunk_size.
 70          :param minimum_chunk_size: The minimum token count per chunk, defaults to 500
 71          :param chunk_delimiter: The character used to determine separator priority.
 72              "." uses sentence-based splitting, "
 73  " uses paragraph-based splitting, defaults to "."
 74          :param summarize_recursively: Whether to use previous summaries as context, defaults to False.
 75          :param split_overlap: Number of tokens to overlap between consecutive chunks, defaults to 0.
 76  
 77  
 78  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.warm_up"></a>
 79  
 80  #### LLMSummarizer.warm\_up
 81  
 82  ```python
 83  def warm_up()
 84  ```
 85  
 86  Warm up the chat generator and document splitter components.
 87  
 88  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.to_dict"></a>
 89  
 90  #### LLMSummarizer.to\_dict
 91  
 92  ```python
 93  def to_dict() -> dict[str, Any]
 94  ```
 95  
 96  Serializes the component to a dictionary.
 97  
 98  **Returns**:
 99  
100  Dictionary with serialized data.
101  
102  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.from_dict"></a>
103  
104  #### LLMSummarizer.from\_dict
105  
106  ```python
107  @classmethod
108  def from_dict(cls, data: dict[str, Any]) -> "LLMSummarizer"
109  ```
110  
111  Deserializes the component from a dictionary.
112  
113  **Arguments**:
114  
115  - `data`: Dictionary with serialized data.
116  
117  **Returns**:
118  
119  An instance of the component.
120  
121  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.num_tokens"></a>
122  
123  #### LLMSummarizer.num\_tokens
124  
125  ```python
126  def num_tokens(text: str) -> int
127  ```
128  
129  Estimates the token count for a given text.
130  
131  Uses the RecursiveDocumentSplitter's tokenization logic for consistency.
132  
133  **Arguments**:
134  
135  - `text`: The text to tokenize
136  
137  **Returns**:
138  
139  The estimated token count
140  
141  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.summarize"></a>
142  
143  #### LLMSummarizer.summarize
144  
145  ```python
146  def summarize(text: str,
147                detail: float,
148                minimum_chunk_size: int,
149                summarize_recursively: bool = False) -> str
150  ```
151  
152  Summarizes text by splitting it into optimally-sized chunks and processing each with an LLM.
153  
154  **Arguments**:
155  
156  - `text`: Text to summarize
157  - `detail`: Detail level (0-1) where 0 is most concise and 1 is most detailed
158  - `minimum_chunk_size`: Minimum token count per chunk
159  - `summarize_recursively`: Whether to use previous summaries as context
160  
161  **Raises**:
162  
163  - `ValueError`: If detail is not between 0 and 1
164  
165  **Returns**:
166  
167  The textual content summarized by the LLM.
168  
169  <a id="haystack_experimental.components.summarizers.llm_summarizer.LLMSummarizer.run"></a>
170  
171  #### LLMSummarizer.run
172  
173  ```python
174  @component.output_types(summary=list[Document])
175  def run(*,
176          documents: list[Document],
177          detail: float | None = None,
178          minimum_chunk_size: int | None = None,
179          summarize_recursively: bool | None = None,
180          system_prompt: str | None = None) -> dict[str, list[Document]]
181  ```
182  
183  Run the summarizer on a list of documents.
184  
185  **Arguments**:
186  
187  - `documents`: List of documents to summarize
188  - `detail`: The level of detail for the summary (0-1), defaults to 0 overwriting the component's default.
189  - `minimum_chunk_size`: The minimum token count per chunk, defaults to 500 overwriting the
190  component's default.
191  - `system_prompt`: If given it will overwrite prompt given at init time or the default one.
192  - `summarize_recursively`: Whether to use previous summaries as context, defaults to False overwriting the
193  component's default.
194  
195  **Raises**:
196  
197  - `RuntimeError`: If the component wasn't warmed up.