readers_api.md
1 --- 2 title: "Readers" 3 id: readers-api 4 description: "Takes a query and a set of Documents as input and returns ExtractedAnswers by selecting a text span within the Documents." 5 slug: "/readers-api" 6 --- 7 8 <a id="extractive"></a> 9 10 ## Module extractive 11 12 <a id="extractive.ExtractiveReader"></a> 13 14 ### ExtractiveReader 15 16 Locates and extracts answers to a given query from Documents. 17 18 The ExtractiveReader component performs extractive question answering. 19 It assigns a score to every possible answer span independently of other answer spans. 20 This fixes a common issue of other implementations which make comparisons across documents harder by normalizing 21 each document's answers independently. 22 23 Example usage: 24 ```python 25 from haystack import Document 26 from haystack.components.readers import ExtractiveReader 27 28 docs = [ 29 Document(content="Python is a popular programming language"), 30 Document(content="python ist eine beliebte Programmiersprache"), 31 ] 32 33 reader = ExtractiveReader() 34 reader.warm_up() 35 36 question = "What is a popular programming language?" 37 result = reader.run(query=question, documents=docs) 38 assert "Python" in result["answers"][0].data 39 ``` 40 41 <a id="extractive.ExtractiveReader.__init__"></a> 42 43 #### ExtractiveReader.\_\_init\_\_ 44 45 ```python 46 def __init__(model: Path | str = "deepset/roberta-base-squad2-distilled", 47 device: ComponentDevice | None = None, 48 token: Secret | None = Secret.from_env_var( 49 ["HF_API_TOKEN", "HF_TOKEN"], strict=False), 50 top_k: int = 20, 51 score_threshold: float | None = None, 52 max_seq_length: int = 384, 53 stride: int = 128, 54 max_batch_size: int | None = None, 55 answers_per_seq: int | None = None, 56 no_answer: bool = True, 57 calibration_factor: float = 0.1, 58 overlap_threshold: float | None = 0.01, 59 model_kwargs: dict[str, Any] | None = None) -> None 60 ``` 61 62 Creates an instance of ExtractiveReader. 63 64 **Arguments**: 65 66 - `model`: A Hugging Face transformers question answering model. 67 Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub. 68 - `device`: The device on which the model is loaded. If `None`, the default device is automatically selected. 69 - `token`: The API token used to download private models from Hugging Face. 70 - `top_k`: Number of answers to return per query. It is required even if score_threshold is set. 71 An additional answer with no text is returned if no_answer is set to True (default). 72 - `score_threshold`: Returns only answers with the probability score above this threshold. 73 - `max_seq_length`: Maximum number of tokens. If a sequence exceeds it, the sequence is split. 74 - `stride`: Number of tokens that overlap when sequence is split because it exceeds max_seq_length. 75 - `max_batch_size`: Maximum number of samples that are fed through the model at the same time. 76 - `answers_per_seq`: Number of answer candidates to consider per sequence. 77 This is relevant when a Document was split into multiple sequences because of max_seq_length. 78 - `no_answer`: Whether to return an additional `no answer` with an empty text and a score representing the 79 probability that the other top_k answers are incorrect. 80 - `calibration_factor`: Factor used for calibrating probabilities. 81 - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the 82 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 83 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 84 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 85 both of these answers could be kept if this variable is set to 0.24 or lower. 86 If None is provided then all answers are kept. 87 - `model_kwargs`: Additional keyword arguments passed to `AutoModelForQuestionAnswering.from_pretrained` 88 when loading the model specified in `model`. For details on what kwargs you can pass, 89 see the model's documentation. 90 91 <a id="extractive.ExtractiveReader.to_dict"></a> 92 93 #### ExtractiveReader.to\_dict 94 95 ```python 96 def to_dict() -> dict[str, Any] 97 ``` 98 99 Serializes the component to a dictionary. 100 101 **Returns**: 102 103 Dictionary with serialized data. 104 105 <a id="extractive.ExtractiveReader.from_dict"></a> 106 107 #### ExtractiveReader.from\_dict 108 109 ```python 110 @classmethod 111 def from_dict(cls, data: dict[str, Any]) -> "ExtractiveReader" 112 ``` 113 114 Deserializes the component from a dictionary. 115 116 **Arguments**: 117 118 - `data`: Dictionary to deserialize from. 119 120 **Returns**: 121 122 Deserialized component. 123 124 <a id="extractive.ExtractiveReader.warm_up"></a> 125 126 #### ExtractiveReader.warm\_up 127 128 ```python 129 def warm_up() 130 ``` 131 132 Initializes the component. 133 134 <a id="extractive.ExtractiveReader.deduplicate_by_overlap"></a> 135 136 #### ExtractiveReader.deduplicate\_by\_overlap 137 138 ```python 139 def deduplicate_by_overlap( 140 answers: list[ExtractedAnswer], 141 overlap_threshold: float | None) -> list[ExtractedAnswer] 142 ``` 143 144 De-duplicates overlapping Extractive Answers. 145 146 De-duplicates overlapping Extractive Answers from the same document based on how much the spans of the 147 answers overlap. 148 149 **Arguments**: 150 151 - `answers`: List of answers to be deduplicated. 152 - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the 153 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 154 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 155 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 156 both of these answers could be kept if this variable is set to 0.24 or lower. 157 If None is provided then all answers are kept. 158 159 **Returns**: 160 161 List of deduplicated answers. 162 163 <a id="extractive.ExtractiveReader.run"></a> 164 165 #### ExtractiveReader.run 166 167 ```python 168 @component.output_types(answers=list[ExtractedAnswer]) 169 def run(query: str, 170 documents: list[Document], 171 top_k: int | None = None, 172 score_threshold: float | None = None, 173 max_seq_length: int | None = None, 174 stride: int | None = None, 175 max_batch_size: int | None = None, 176 answers_per_seq: int | None = None, 177 no_answer: bool | None = None, 178 overlap_threshold: float | None = None) 179 ``` 180 181 Locates and extracts answers from the given Documents using the given query. 182 183 **Arguments**: 184 185 - `query`: Query string. 186 - `documents`: List of Documents in which you want to search for an answer to the query. 187 - `top_k`: The maximum number of answers to return. 188 An additional answer is returned if no_answer is set to True (default). 189 - `score_threshold`: Returns only answers with the score above this threshold. 190 - `max_seq_length`: Maximum number of tokens. If a sequence exceeds it, the sequence is split. 191 - `stride`: Number of tokens that overlap when sequence is split because it exceeds max_seq_length. 192 - `max_batch_size`: Maximum number of samples that are fed through the model at the same time. 193 - `answers_per_seq`: Number of answer candidates to consider per sequence. 194 This is relevant when a Document was split into multiple sequences because of max_seq_length. 195 - `no_answer`: Whether to return no answer scores. 196 - `overlap_threshold`: If set this will remove duplicate answers if they have an overlap larger than the 197 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 198 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 199 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 200 both of these answers could be kept if this variable is set to 0.24 or lower. 201 If None is provided then all answers are kept. 202 203 **Returns**: 204 205 List of answers sorted by (desc.) answer score. 206