readers_api.md
1 --- 2 title: "Readers" 3 id: readers-api 4 description: "Takes a query and a set of Documents as input and returns ExtractedAnswers by selecting a text span within the Documents." 5 slug: "/readers-api" 6 --- 7 8 9 ## extractive 10 11 ### ExtractiveReader 12 13 Locates and extracts answers to a given query from Documents. 14 15 The ExtractiveReader component performs extractive question answering. 16 It assigns a score to every possible answer span independently of other answer spans. 17 This fixes a common issue of other implementations which make comparisons across documents harder by normalizing 18 each document's answers independently. 19 20 Example usage: 21 22 ```python 23 from haystack import Document 24 from haystack.components.readers import ExtractiveReader 25 26 docs = [ 27 Document(content="Python is a popular programming language"), 28 Document(content="python ist eine beliebte Programmiersprache"), 29 ] 30 31 reader = ExtractiveReader() 32 33 question = "What is a popular programming language?" 34 result = reader.run(query=question, documents=docs) 35 assert "Python" in result["answers"][0].data 36 ``` 37 38 #### __init__ 39 40 ```python 41 __init__( 42 model: Path | str = "deepset/roberta-base-squad2-distilled", 43 device: ComponentDevice | None = None, 44 token: Secret | None = Secret.from_env_var( 45 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 46 ), 47 top_k: int = 20, 48 score_threshold: float | None = None, 49 max_seq_length: int = 384, 50 stride: int = 128, 51 max_batch_size: int | None = None, 52 answers_per_seq: int | None = None, 53 no_answer: bool = True, 54 calibration_factor: float = 0.1, 55 overlap_threshold: float | None = 0.01, 56 model_kwargs: dict[str, Any] | None = None, 57 ) -> None 58 ``` 59 60 Creates an instance of ExtractiveReader. 61 62 **Parameters:** 63 64 - **model** (<code>Path | str</code>) – A Hugging Face transformers question answering model. 65 Can either be a path to a folder containing the model files or an identifier for the Hugging Face hub. 66 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically selected. 67 - **token** (<code>Secret | None</code>) – The API token used to download private models from Hugging Face. 68 - **top_k** (<code>int</code>) – Number of answers to return per query. It is required even if score_threshold is set. 69 An additional answer with no text is returned if no_answer is set to True (default). 70 - **score_threshold** (<code>float | None</code>) – Returns only answers with the probability score above this threshold. 71 - **max_seq_length** (<code>int</code>) – Maximum number of tokens. If a sequence exceeds it, the sequence is split. 72 - **stride** (<code>int</code>) – Number of tokens that overlap when sequence is split because it exceeds max_seq_length. 73 - **max_batch_size** (<code>int | None</code>) – Maximum number of samples that are fed through the model at the same time. 74 - **answers_per_seq** (<code>int | None</code>) – Number of answer candidates to consider per sequence. 75 This is relevant when a Document was split into multiple sequences because of max_seq_length. 76 - **no_answer** (<code>bool</code>) – Whether to return an additional `no answer` with an empty text and a score representing the 77 probability that the other top_k answers are incorrect. 78 - **calibration_factor** (<code>float</code>) – Factor used for calibrating probabilities. 79 - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the 80 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 81 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 82 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 83 both of these answers could be kept if this variable is set to 0.24 or lower. 84 If None is provided then all answers are kept. 85 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments passed to `AutoModelForQuestionAnswering.from_pretrained` 86 when loading the model specified in `model`. For details on what kwargs you can pass, 87 see the model's documentation. 88 89 #### to_dict 90 91 ```python 92 to_dict() -> dict[str, Any] 93 ``` 94 95 Serializes the component to a dictionary. 96 97 **Returns:** 98 99 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 100 101 #### from_dict 102 103 ```python 104 from_dict(data: dict[str, Any]) -> ExtractiveReader 105 ``` 106 107 Deserializes the component from a dictionary. 108 109 **Parameters:** 110 111 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 112 113 **Returns:** 114 115 - <code>ExtractiveReader</code> – Deserialized component. 116 117 #### warm_up 118 119 ```python 120 warm_up() -> None 121 ``` 122 123 Initializes the component. 124 125 #### deduplicate_by_overlap 126 127 ```python 128 deduplicate_by_overlap( 129 answers: list[ExtractedAnswer], overlap_threshold: float | None 130 ) -> list[ExtractedAnswer] 131 ``` 132 133 De-duplicates overlapping Extractive Answers. 134 135 De-duplicates overlapping Extractive Answers from the same document based on how much the spans of the 136 answers overlap. 137 138 **Parameters:** 139 140 - **answers** (<code>list\[ExtractedAnswer\]</code>) – List of answers to be deduplicated. 141 - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the 142 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 143 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 144 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 145 both of these answers could be kept if this variable is set to 0.24 or lower. 146 If None is provided then all answers are kept. 147 148 **Returns:** 149 150 - <code>list\[ExtractedAnswer\]</code> – List of deduplicated answers. 151 152 #### run 153 154 ```python 155 run( 156 query: str, 157 documents: list[Document], 158 top_k: int | None = None, 159 score_threshold: float | None = None, 160 max_seq_length: int | None = None, 161 stride: int | None = None, 162 max_batch_size: int | None = None, 163 answers_per_seq: int | None = None, 164 no_answer: bool | None = None, 165 overlap_threshold: float | None = None, 166 ) -> dict[str, Any] 167 ``` 168 169 Locates and extracts answers from the given Documents using the given query. 170 171 **Parameters:** 172 173 - **query** (<code>str</code>) – Query string. 174 - **documents** (<code>list\[Document\]</code>) – List of Documents in which you want to search for an answer to the query. 175 - **top_k** (<code>int | None</code>) – The maximum number of answers to return. 176 An additional answer is returned if no_answer is set to True (default). 177 - **score_threshold** (<code>float | None</code>) – Returns only answers with the score above this threshold. 178 - **max_seq_length** (<code>int | None</code>) – Maximum number of tokens. If a sequence exceeds it, the sequence is split. 179 - **stride** (<code>int | None</code>) – Number of tokens that overlap when sequence is split because it exceeds max_seq_length. 180 - **max_batch_size** (<code>int | None</code>) – Maximum number of samples that are fed through the model at the same time. 181 - **answers_per_seq** (<code>int | None</code>) – Number of answer candidates to consider per sequence. 182 This is relevant when a Document was split into multiple sequences because of max_seq_length. 183 - **no_answer** (<code>bool | None</code>) – Whether to return no answer scores. 184 - **overlap_threshold** (<code>float | None</code>) – If set this will remove duplicate answers if they have an overlap larger than the 185 supplied threshold. For example, for the answers "in the river in Maine" and "the river" we would remove 186 one of these answers since the second answer has a 100% (1.0) overlap with the first answer. 187 However, for the answers "the river in" and "in Maine" there is only a max overlap percentage of 25% so 188 both of these answers could be kept if this variable is set to 0.24 or lower. 189 If None is provided then all answers are kept. 190 191 **Returns:** 192 193 - <code>dict\[str, Any\]</code> – List of answers sorted by (desc.) answer score.