presidio.md
1 --- 2 title: "Presidio" 3 id: integrations-presidio 4 description: "Presidio integration for Haystack" 5 slug: "/integrations-presidio" 6 --- 7 8 9 ## haystack_integrations.components.extractors.presidio.presidio_entity_extractor 10 11 ### PresidioEntityExtractor 12 13 Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. 14 15 See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. 16 17 Accepts a list of Documents and returns new Documents with detected PII entities stored 18 in each Document's metadata under the key `"entities"`. Each entry in the list contains 19 the entity type, start/end character offsets, and the confidence score. 20 21 Original Documents are not mutated. Documents without text content are passed through unchanged. 22 23 The analyzer engine is loaded on the first call to `run()`, 24 or by calling `warm_up()` explicitly beforehand. 25 26 ### Usage example 27 28 ```python 29 from haystack import Document 30 from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor 31 32 extractor = PresidioEntityExtractor() 33 result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) 34 print(result["documents"][0].meta["entities"]) 35 # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, 36 # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] 37 ``` 38 39 #### SPACY_DEFAULT_MODELS 40 41 ```python 42 SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS 43 ``` 44 45 Mapping from ISO 639-1 language code to the largest available spaCy model for that language. 46 47 Used to automatically select an NLP model when `models` is not specified. 48 See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models. 49 50 #### __init__ 51 52 ```python 53 __init__( 54 *, 55 language: str = "en", 56 entities: list[str] | None = None, 57 score_threshold: float = 0.35, 58 models: list[dict[str, str]] | None = None 59 ) -> None 60 ``` 61 62 Initializes the PresidioEntityExtractor. 63 64 **Parameters:** 65 66 - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`. 67 For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate 68 spaCy model is loaded automatically at warm-up time — no need to set `models`. 69 For unsupported languages, use the `models` parameter to configure a custom model. 70 See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). 71 - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). 72 If `None`, all supported entity types are detected. 73 See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). 74 - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. 75 See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). 76 - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations. 77 Each entry must contain `"lang_code"` and `"model_name"` keys, 78 e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`. 79 Use this only when you need a specific model variant or a language not covered by the 80 built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS` 81 based on `language`. 82 83 #### warm_up 84 85 ```python 86 warm_up() -> None 87 ``` 88 89 Initializes the Presidio analyzer engine. 90 91 This method loads the underlying NLP models. In a Haystack Pipeline, 92 this is called automatically before the first run. 93 94 #### run 95 96 ```python 97 run(documents: list[Document]) -> dict[str, list[Document]] 98 ``` 99 100 Detects PII entities in the provided Documents. 101 102 **Parameters:** 103 104 - **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities. 105 106 **Returns:** 107 108 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities 109 stored in metadata under the key `"entities"`. 110 111 ## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner 112 113 ### PresidioDocumentCleaner 114 115 Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). 116 117 Accepts a list of Documents, detects personally identifiable information (PII) in their 118 text content, and returns new Documents with PII replaced by entity type placeholders 119 (e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated. 120 121 Documents without text content are passed through unchanged. 122 123 The analyzer and anonymizer engines are loaded on the first call to `run()`, 124 or by calling `warm_up()` explicitly beforehand. 125 126 ### Usage example 127 128 ```python 129 from haystack import Document 130 from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner 131 132 cleaner = PresidioDocumentCleaner() 133 result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) 134 print(result["documents"][0].content) 135 # My name is <PERSON> and my email is <EMAIL_ADDRESS> 136 ``` 137 138 #### SPACY_DEFAULT_MODELS 139 140 ```python 141 SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS 142 ``` 143 144 Mapping from ISO 639-1 language code to the largest available spaCy model for that language. 145 146 Used to automatically select an NLP model when `models` is not specified. 147 See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models. 148 149 #### __init__ 150 151 ```python 152 __init__( 153 *, 154 language: str = "en", 155 entities: list[str] | None = None, 156 score_threshold: float = 0.35, 157 models: list[dict[str, str]] | None = None 158 ) -> None 159 ``` 160 161 Initializes the PresidioDocumentCleaner. 162 163 **Parameters:** 164 165 - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`. 166 For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate 167 spaCy model is loaded automatically at warm-up time — no need to set `models`. 168 For unsupported languages, use the `models` parameter to configure a custom model. 169 See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). 170 - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). 171 If `None`, all supported entity types are used. 172 See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). 173 - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. 174 See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). 175 - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations. 176 Each entry must contain `"lang_code"` and `"model_name"` keys, 177 e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`. 178 Use this only when you need a specific model variant or a language not covered by the 179 built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS` 180 based on `language`. 181 182 #### warm_up 183 184 ```python 185 warm_up() -> None 186 ``` 187 188 Initializes the Presidio analyzer and anonymizer engines. 189 190 This method loads the underlying NLP models. In a Haystack Pipeline, 191 this is called automatically before the first run. 192 193 #### run 194 195 ```python 196 run(documents: list[Document]) -> dict[str, list[Document]] 197 ``` 198 199 Anonymizes PII in the provided Documents. 200 201 **Parameters:** 202 203 - **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized. 204 205 **Returns:** 206 207 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents. 208 209 ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner 210 211 ### PresidioTextCleaner 212 213 Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). 214 215 Accepts a list of strings, detects personally identifiable information (PII), and returns 216 a new list of strings with PII replaced by entity type placeholders (e.g. `<PERSON>`). 217 Useful for sanitizing user queries before they are sent to an LLM. 218 219 The analyzer and anonymizer engines are loaded on the first call to `run()`, 220 or by calling `warm_up()` explicitly beforehand. 221 222 ### Usage example 223 224 ```python 225 from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner 226 227 cleaner = PresidioTextCleaner() 228 result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) 229 print(result["texts"][0]) 230 # Hi, I am <PERSON>, call me at <PHONE_NUMBER> 231 ``` 232 233 #### SPACY_DEFAULT_MODELS 234 235 ```python 236 SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS 237 ``` 238 239 Mapping from ISO 639-1 language code to the largest available spaCy model for that language. 240 241 Used to automatically select an NLP model when `models` is not specified. 242 See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models. 243 244 #### __init__ 245 246 ```python 247 __init__( 248 *, 249 language: str = "en", 250 entities: list[str] | None = None, 251 score_threshold: float = 0.35, 252 models: list[dict[str, str]] | None = None 253 ) -> None 254 ``` 255 256 Initializes the PresidioTextCleaner. 257 258 **Parameters:** 259 260 - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`. 261 For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate 262 spaCy model is loaded automatically at warm-up time — no need to set `models`. 263 For unsupported languages, use the `models` parameter to configure a custom model. 264 See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). 265 - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). 266 If `None`, all supported entity types are used. 267 See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). 268 - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. 269 See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). 270 - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations. 271 Each entry must contain `"lang_code"` and `"model_name"` keys, 272 e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`. 273 Use this only when you need a specific model variant or a language not covered by the 274 built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS` 275 based on `language`. 276 277 #### warm_up 278 279 ```python 280 warm_up() -> None 281 ``` 282 283 Initializes the Presidio analyzer and anonymizer engines. 284 285 This method loads the underlying NLP models. In a Haystack Pipeline, 286 this is called automatically before the first run. 287 288 #### run 289 290 ```python 291 run(texts: list[str]) -> dict[str, list[str]] 292 ``` 293 294 Anonymizes PII in the provided strings. 295 296 **Parameters:** 297 298 - **texts** (<code>list\[str\]</code>) – List of strings to anonymize. 299 300 **Returns:** 301 302 - <code>dict\[str, list\[str\]\]</code> – A dictionary with key `texts` containing the cleaned strings.