Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.27 / haystack-api / readers_api.md
readers_api.md
  1  ---
  2  title: "Readers"
  3  id: readers-api
  4  description: "Takes a query and a set of Documents as input and returns ExtractedAnswers by selecting a text span within the Documents."
  5  slug: "/readers-api"
  6  ---
  7  
  8  
  9  ## extractive
 10  
 11  ### ExtractiveReader
 12  
 13  Locates and extracts answers to a given query from Documents.
 14  
 15  The ExtractiveReader component performs extractive question answering.
 16  It assigns a score to every possible answer span independently of other answer spans.
 17  This fixes a common issue of other implementations which make comparisons across documents harder by normalizing
 18  each document's answers independently.
 19  
 20  Example usage:
 21  
 22  ```python
 23  from haystack import Document
 24  from haystack.components.readers import ExtractiveReader
 25  
 26  docs = [
 27      Document(content="Python is a popular programming language"),
 28      Document(content="python ist eine beliebte Programmiersprache"),
 29  ]
 30  
 31  reader = ExtractiveReader()
 32  
 33  question = "What is a popular programming language?"
 34  result = reader.run(query=question, documents=docs)
 35  assert "Python" in result["answers"][0].data
 36  ```
 37  
 38  #### __init__
 39  
 40  ```python
 41  __init__(
 42      model: Path | str = "deepset/roberta-base-squad2-distilled",
 43      device: ComponentDevice | None = None,
 44      token: Secret | None = Secret.from_env_var(
 45          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 46      ),
 47      top_k: int = 20,
 48      score_threshold: float | None = None,
 49      max_seq_length: int = 384,
 50      stride: int = 128,
 51      max_batch_size: int | None = None,
 52      answers_per_seq: int | None = None,
 53      no_answer: bool = True,
 54      calibration_factor: float = 0.1,
 55      overlap_threshold: float | None = 0.01,
 56      model_kwargs: dict[str, Any] | None = None,
 57  ) -> None
 58  ```
 59  
 60  Creates an instance of ExtractiveReader.
 61  
 62  **Parameters:**
 63  
 64  - **model** (<code>Path | str</code>) – A Hugging Face transformers question answering model.
 65    Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub.
 66  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically selected.
 67  - **token** (<code>Secret | None</code>) – The API token used to download private models from Hugging Face.
 68  - **top_k** (<code>int</code>) – Number of answers to return per query. It is required even if score_threshold is set.
 69    An additional answer with no text is returned if no_answer is set to True (default).
 70  - **score_threshold** (<code>float | None</code>) – Returns only answers with the probability score above this threshold.
 71  - **max_seq_length** (<code>int</code>) – Maximum number of tokens. If a sequence exceeds it, the sequence is split.
 72  - **stride** (<code>int</code>) – Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
 73  - **max_batch_size** (<code>int | None</code>) – Maximum number of samples that are fed through the model at the same time.
 74  - **answers_per_seq** (<code>int | None</code>) – Number of answer candidates to consider per sequence.
 75    This is relevant when a Document was split into multiple sequences because of max_seq_length.
 76  - **no_answer** (<code>bool</code>) – Whether to return an additional `no answer` with an empty text and a score representing the
 77    probability that the other top_k answers are incorrect.
 78  - **calibration_factor** (<code>float</code>) – Factor used for calibrating probabilities.
 79  - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the
 80    supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
 81    one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
 82    However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
 83    both of these answers could be kept if this variable is set to 0.24 or lower.
 84    If None is provided then all answers are kept.
 85  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments passed to `AutoModelForQuestionAnswering.from_pretrained`
 86    when loading the model specified in `model`. For details on what kwargs you can pass,
 87    see the model's documentation.
 88  
 89  #### to_dict
 90  
 91  ```python
 92  to_dict() -> dict[str, Any]
 93  ```
 94  
 95  Serializes the component to a dictionary.
 96  
 97  **Returns:**
 98  
 99  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
100  
101  #### from_dict
102  
103  ```python
104  from_dict(data: dict[str, Any]) -> ExtractiveReader
105  ```
106  
107  Deserializes the component from a dictionary.
108  
109  **Parameters:**
110  
111  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
112  
113  **Returns:**
114  
115  - <code>ExtractiveReader</code> – Deserialized component.
116  
117  #### warm_up
118  
119  ```python
120  warm_up() -> None
121  ```
122  
123  Initializes the component.
124  
125  #### deduplicate_by_overlap
126  
127  ```python
128  deduplicate_by_overlap(
129      answers: list[ExtractedAnswer], overlap_threshold: float | None
130  ) -> list[ExtractedAnswer]
131  ```
132  
133  De-duplicates overlapping Extractive Answers.
134  
135  De-duplicates overlapping Extractive Answers from the same document based on how much the spans of the
136  answers overlap.
137  
138  **Parameters:**
139  
140  - **answers** (<code>list\[ExtractedAnswer\]</code>) – List of answers to be deduplicated.
141  - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the
142    supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
143    one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
144    However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
145    both of these answers could be kept if this variable is set to 0.24 or lower.
146    If None is provided then all answers are kept.
147  
148  **Returns:**
149  
150  - <code>list\[ExtractedAnswer\]</code> – List of deduplicated answers.
151  
152  #### run
153  
154  ```python
155  run(
156      query: str,
157      documents: list[Document],
158      top_k: int | None = None,
159      score_threshold: float | None = None,
160      max_seq_length: int | None = None,
161      stride: int | None = None,
162      max_batch_size: int | None = None,
163      answers_per_seq: int | None = None,
164      no_answer: bool | None = None,
165      overlap_threshold: float | None = None,
166  ) -> dict[str, Any]
167  ```
168  
169  Locates and extracts answers from the given Documents using the given query.
170  
171  **Parameters:**
172  
173  - **query** (<code>str</code>) – Query string.
174  - **documents** (<code>list\[Document\]</code>) – List of Documents in which you want to search for an answer to the query.
175  - **top_k** (<code>int | None</code>) – The maximum number of answers to return.
176    An additional answer is returned if no_answer is set to True (default).
177  - **score_threshold** (<code>float | None</code>) – Returns only answers with the score above this threshold.
178  - **max_seq_length** (<code>int | None</code>) – Maximum number of tokens. If a sequence exceeds it, the sequence is split.
179  - **stride** (<code>int | None</code>) – Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
180  - **max_batch_size** (<code>int | None</code>) – Maximum number of samples that are fed through the model at the same time.
181  - **answers_per_seq** (<code>int | None</code>) – Number of answer candidates to consider per sequence.
182    This is relevant when a Document was split into multiple sequences because of max_seq_length.
183  - **no_answer** (<code>bool | None</code>) – Whether to return no answer scores.
184  - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the
185    supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
186    one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
187    However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
188    both of these answers could be kept if this variable is set to 0.24 or lower.
189    If None is provided then all answers are kept.
190  
191  **Returns:**
192  
193  - <code>dict\[str, Any\]</code> – List of answers sorted by (desc.) answer score.