Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.19 / integrations-api / presidio.md
presidio.md
  1  ---
  2  title: "Presidio"
  3  id: integrations-presidio
  4  description: "Presidio integration for Haystack"
  5  slug: "/integrations-presidio"
  6  ---
  7  
  8  
  9  ## haystack_integrations.components.extractors.presidio.presidio_entity_extractor
 10  
 11  ### PresidioEntityExtractor
 12  
 13  Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
 14  
 15  See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.
 16  
 17  Accepts a list of Documents and returns new Documents with detected PII entities stored
 18  in each Document's metadata under the key `"entities"`. Each entry in the list contains
 19  the entity type, start/end character offsets, and the confidence score.
 20  
 21  Original Documents are not mutated. Documents without text content are passed through unchanged.
 22  
 23  The analyzer engine is loaded on the first call to `run()`,
 24  or by calling `warm_up()` explicitly beforehand.
 25  
 26  ### Usage example
 27  
 28  ```python
 29  from haystack import Document
 30  from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
 31  
 32  extractor = PresidioEntityExtractor()
 33  result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
 34  print(result["documents"][0].meta["entities"])
 35  # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
 36  #  {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
 37  ```
 38  
 39  #### SPACY_DEFAULT_MODELS
 40  
 41  ```python
 42  SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
 43  ```
 44  
 45  Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
 46  
 47  Used to automatically select an NLP model when `models` is not specified.
 48  See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
 49  
 50  #### __init__
 51  
 52  ```python
 53  __init__(
 54      *,
 55      language: str = "en",
 56      entities: list[str] | None = None,
 57      score_threshold: float = 0.35,
 58      models: list[dict[str, str]] | None = None
 59  ) -> None
 60  ```
 61  
 62  Initializes the PresidioEntityExtractor.
 63  
 64  **Parameters:**
 65  
 66  - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
 67    For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
 68    spaCy model is loaded automatically at warm-up time — no need to set `models`.
 69    For unsupported languages, use the `models` parameter to configure a custom model.
 70    See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
 71  - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
 72    If `None`, all supported entity types are detected.
 73    See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
 74  - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
 75    See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
 76  - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
 77    Each entry must contain `"lang_code"` and `"model_name"` keys,
 78    e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
 79    Use this only when you need a specific model variant or a language not covered by the
 80    built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
 81    based on `language`.
 82  
 83  #### warm_up
 84  
 85  ```python
 86  warm_up() -> None
 87  ```
 88  
 89  Initializes the Presidio analyzer engine.
 90  
 91  This method loads the underlying NLP models. In a Haystack Pipeline,
 92  this is called automatically before the first run.
 93  
 94  #### run
 95  
 96  ```python
 97  run(documents: list[Document]) -> dict[str, list[Document]]
 98  ```
 99  
100  Detects PII entities in the provided Documents.
101  
102  **Parameters:**
103  
104  - **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.
105  
106  **Returns:**
107  
108  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
109    stored in metadata under the key `"entities"`.
110  
111  ## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
112  
113  ### PresidioDocumentCleaner
114  
115  Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).
116  
117  Accepts a list of Documents, detects personally identifiable information (PII) in their
118  text content, and returns new Documents with PII replaced by entity type placeholders
119  (e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.
120  
121  Documents without text content are passed through unchanged.
122  
123  The analyzer and anonymizer engines are loaded on the first call to `run()`,
124  or by calling `warm_up()` explicitly beforehand.
125  
126  ### Usage example
127  
128  ```python
129  from haystack import Document
130  from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
131  
132  cleaner = PresidioDocumentCleaner()
133  result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
134  print(result["documents"][0].content)
135  # My name is <PERSON> and my email is <EMAIL_ADDRESS>
136  ```
137  
138  #### SPACY_DEFAULT_MODELS
139  
140  ```python
141  SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
142  ```
143  
144  Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
145  
146  Used to automatically select an NLP model when `models` is not specified.
147  See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
148  
149  #### __init__
150  
151  ```python
152  __init__(
153      *,
154      language: str = "en",
155      entities: list[str] | None = None,
156      score_threshold: float = 0.35,
157      models: list[dict[str, str]] | None = None
158  ) -> None
159  ```
160  
161  Initializes the PresidioDocumentCleaner.
162  
163  **Parameters:**
164  
165  - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
166    For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
167    spaCy model is loaded automatically at warm-up time — no need to set `models`.
168    For unsupported languages, use the `models` parameter to configure a custom model.
169    See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
170  - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
171    If `None`, all supported entity types are used.
172    See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
173  - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
174    See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
175  - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
176    Each entry must contain `"lang_code"` and `"model_name"` keys,
177    e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
178    Use this only when you need a specific model variant or a language not covered by the
179    built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
180    based on `language`.
181  
182  #### warm_up
183  
184  ```python
185  warm_up() -> None
186  ```
187  
188  Initializes the Presidio analyzer and anonymizer engines.
189  
190  This method loads the underlying NLP models. In a Haystack Pipeline,
191  this is called automatically before the first run.
192  
193  #### run
194  
195  ```python
196  run(documents: list[Document]) -> dict[str, list[Document]]
197  ```
198  
199  Anonymizes PII in the provided Documents.
200  
201  **Parameters:**
202  
203  - **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.
204  
205  **Returns:**
206  
207  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.
208  
209  ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner
210  
211  ### PresidioTextCleaner
212  
213  Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/).
214  
215  Accepts a list of strings, detects personally identifiable information (PII), and returns
216  a new list of strings with PII replaced by entity type placeholders (e.g. `<PERSON>`).
217  Useful for sanitizing user queries before they are sent to an LLM.
218  
219  The analyzer and anonymizer engines are loaded on the first call to `run()`,
220  or by calling `warm_up()` explicitly beforehand.
221  
222  ### Usage example
223  
224  ```python
225  from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
226  
227  cleaner = PresidioTextCleaner()
228  result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"])
229  print(result["texts"][0])
230  # Hi, I am <PERSON>, call me at <PHONE_NUMBER>
231  ```
232  
233  #### SPACY_DEFAULT_MODELS
234  
235  ```python
236  SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
237  ```
238  
239  Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
240  
241  Used to automatically select an NLP model when `models` is not specified.
242  See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
243  
244  #### __init__
245  
246  ```python
247  __init__(
248      *,
249      language: str = "en",
250      entities: list[str] | None = None,
251      score_threshold: float = 0.35,
252      models: list[dict[str, str]] | None = None
253  ) -> None
254  ```
255  
256  Initializes the PresidioTextCleaner.
257  
258  **Parameters:**
259  
260  - **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
261    For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
262    spaCy model is loaded automatically at warm-up time — no need to set `models`.
263    For unsupported languages, use the `models` parameter to configure a custom model.
264    See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
265  - **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`).
266    If `None`, all supported entity types are used.
267    See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
268  - **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
269    See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
270  - **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
271    Each entry must contain `"lang_code"` and `"model_name"` keys,
272    e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
273    Use this only when you need a specific model variant or a language not covered by the
274    built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
275    based on `language`.
276  
277  #### warm_up
278  
279  ```python
280  warm_up() -> None
281  ```
282  
283  Initializes the Presidio analyzer and anonymizer engines.
284  
285  This method loads the underlying NLP models. In a Haystack Pipeline,
286  this is called automatically before the first run.
287  
288  #### run
289  
290  ```python
291  run(texts: list[str]) -> dict[str, list[str]]
292  ```
293  
294  Anonymizes PII in the provided strings.
295  
296  **Parameters:**
297  
298  - **texts** (<code>list\[str\]</code>) – List of strings to anonymize.
299  
300  **Returns:**
301  
302  - <code>dict\[str, list\[str\]\]</code> – A dictionary with key `texts` containing the cleaned strings.