Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.24 / haystack-api / readers_api.md
readers_api.md
  1  ---
  2  title: "Readers"
  3  id: readers-api
  4  description: "Takes a query and a set of Documents as input and returns ExtractedAnswers by selecting a text span within the Documents."
  5  slug: "/readers-api"
  6  ---
  7  
  8  <a id="extractive"></a>
  9  
 10  ## Module extractive
 11  
 12  <a id="extractive.ExtractiveReader"></a>
 13  
 14  ### ExtractiveReader
 15  
 16  Locates and extracts answers to a given query from Documents.
 17  
 18  The ExtractiveReader component performs extractive question answering.
 19  It assigns a score to every possible answer span independently of other answer spans.
 20  This fixes a common issue of other implementations which make comparisons across documents harder by normalizing
 21  each document's answers independently.
 22  
 23  Example usage:
 24  ```python
 25  from haystack import Document
 26  from haystack.components.readers import ExtractiveReader
 27  
 28  docs = [
 29      Document(content="Python is a popular programming language"),
 30      Document(content="python ist eine beliebte Programmiersprache"),
 31  ]
 32  
 33  reader = ExtractiveReader()
 34  reader.warm_up()
 35  
 36  question = "What is a popular programming language?"
 37  result = reader.run(query=question, documents=docs)
 38  assert "Python" in result["answers"][0].data
 39  ```
 40  
 41  <a id="extractive.ExtractiveReader.__init__"></a>
 42  
 43  #### ExtractiveReader.\_\_init\_\_
 44  
 45  ```python
 46  def __init__(model: Path | str = "deepset/roberta-base-squad2-distilled",
 47               device: ComponentDevice | None = None,
 48               token: Secret | None = Secret.from_env_var(
 49                   ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
 50               top_k: int = 20,
 51               score_threshold: float | None = None,
 52               max_seq_length: int = 384,
 53               stride: int = 128,
 54               max_batch_size: int | None = None,
 55               answers_per_seq: int | None = None,
 56               no_answer: bool = True,
 57               calibration_factor: float = 0.1,
 58               overlap_threshold: float | None = 0.01,
 59               model_kwargs: dict[str, Any] | None = None) -> None
 60  ```
 61  
 62  Creates an instance of ExtractiveReader.
 63  
 64  **Arguments**:
 65  
 66  - `model`: A Hugging Face transformers question answering model.
 67  Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub.
 68  - `device`: The device on which the model is loaded. If `None`, the default device is automatically selected.
 69  - `token`: The API token used to download private models from Hugging Face.
 70  - `top_k`: Number of answers to return per query. It is required even if score_threshold is set.
 71  An additional answer with no text is returned if no_answer is set to True (default).
 72  - `score_threshold`: Returns only answers with the probability score above this threshold.
 73  - `max_seq_length`: Maximum number of tokens. If a sequence exceeds it, the sequence is split.
 74  - `stride`: Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
 75  - `max_batch_size`: Maximum number of samples that are fed through the model at the same time.
 76  - `answers_per_seq`: Number of answer candidates to consider per sequence.
 77  This is relevant when a Document was split into multiple sequences because of max_seq_length.
 78  - `no_answer`: Whether to return an additional `no answer` with an empty text and a score representing the
 79  probability that the other top_k answers are incorrect.
 80  - `calibration_factor`: Factor used for calibrating probabilities.
 81  - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the
 82  supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
 83  one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
 84  However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
 85  both of these answers could be kept if this variable is set to 0.24 or lower.
 86  If None is provided then all answers are kept.
 87  - `model_kwargs`: Additional keyword arguments passed to `AutoModelForQuestionAnswering.from_pretrained`
 88  when loading the model specified in `model`. For details on what kwargs you can pass,
 89  see the model's documentation.
 90  
 91  <a id="extractive.ExtractiveReader.to_dict"></a>
 92  
 93  #### ExtractiveReader.to\_dict
 94  
 95  ```python
 96  def to_dict() -> dict[str, Any]
 97  ```
 98  
 99  Serializes the component to a dictionary.
100  
101  **Returns**:
102  
103  Dictionary with serialized data.
104  
105  <a id="extractive.ExtractiveReader.from_dict"></a>
106  
107  #### ExtractiveReader.from\_dict
108  
109  ```python
110  @classmethod
111  def from_dict(cls, data: dict[str, Any]) -> "ExtractiveReader"
112  ```
113  
114  Deserializes the component from a dictionary.
115  
116  **Arguments**:
117  
118  - `data`: Dictionary to deserialize from.
119  
120  **Returns**:
121  
122  Deserialized component.
123  
124  <a id="extractive.ExtractiveReader.warm_up"></a>
125  
126  #### ExtractiveReader.warm\_up
127  
128  ```python
129  def warm_up()
130  ```
131  
132  Initializes the component.
133  
134  <a id="extractive.ExtractiveReader.deduplicate_by_overlap"></a>
135  
136  #### ExtractiveReader.deduplicate\_by\_overlap
137  
138  ```python
139  def deduplicate_by_overlap(
140          answers: list[ExtractedAnswer],
141          overlap_threshold: float | None) -> list[ExtractedAnswer]
142  ```
143  
144  De-duplicates overlapping Extractive Answers.
145  
146  De-duplicates overlapping Extractive Answers from the same document based on how much the spans of the
147  answers overlap.
148  
149  **Arguments**:
150  
151  - `answers`: List of answers to be deduplicated.
152  - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the
153  supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
154  one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
155  However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
156  both of these answers could be kept if this variable is set to 0.24 or lower.
157  If None is provided then all answers are kept.
158  
159  **Returns**:
160  
161  List of deduplicated answers.
162  
163  <a id="extractive.ExtractiveReader.run"></a>
164  
165  #### ExtractiveReader.run
166  
167  ```python
168  @component.output_types(answers=list[ExtractedAnswer])
169  def run(query: str,
170          documents: list[Document],
171          top_k: int | None = None,
172          score_threshold: float | None = None,
173          max_seq_length: int | None = None,
174          stride: int | None = None,
175          max_batch_size: int | None = None,
176          answers_per_seq: int | None = None,
177          no_answer: bool | None = None,
178          overlap_threshold: float | None = None)
179  ```
180  
181  Locates and extracts answers from the given Documents using the given query.
182  
183  **Arguments**:
184  
185  - `query`: Query string.
186  - `documents`: List of Documents in which you want to search for an answer to the query.
187  - `top_k`: The maximum number of answers to return.
188  An additional answer is returned if no_answer is set to True (default).
189  - `score_threshold`: Returns only answers with the score above this threshold.
190  - `max_seq_length`: Maximum number of tokens. If a sequence exceeds it, the sequence is split.
191  - `stride`: Number of tokens that overlap when sequence is split because it exceeds max_seq_length.
192  - `max_batch_size`: Maximum number of samples that are fed through the model at the same time.
193  - `answers_per_seq`: Number of answer candidates to consider per sequence.
194  This is relevant when a Document was split into multiple sequences because of max_seq_length.
195  - `no_answer`: Whether to return no answer scores.
196  - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the
197  supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove
198  one of these answers since the second answer has a 100% (1.0) overlap with the first answer.
199  However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so
200  both of these answers could be kept if this variable is set to 0.24 or lower.
201  If None is provided then all answers are kept.
202  
203  **Returns**:
204  
205  List of answers sorted by (desc.) answer score.
206