Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.24 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: "Extractors"
  3  id: extractors-api
  4  description: "Components to extract specific elements from textual data."
  5  slug: "/extractors-api"
  6  ---
  7  
  8  <a id="image/llm_document_content_extractor"></a>
  9  
 10  ## Module image/llm\_document\_content\_extractor
 11  
 12  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a>
 13  
 14  ### LLMDocumentContentExtractor
 15  
 16  Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model).
 17  
 18  This component converts each input document into an image using the DocumentToImageContent component,
 19  uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured
 20  textual content based on the provided prompt.
 21  
 22  The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt
 23  are passed together to the LLM as a chat message.
 24  
 25  Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These
 26  failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for
 27  debugging or for reprocessing the documents later.
 28  
 29  ### Usage example
 30  ```python
 31  from haystack import Document
 32  from haystack.components.generators.chat import OpenAIChatGenerator
 33  from haystack.components.extractors.image import LLMDocumentContentExtractor
 34  chat_generator = OpenAIChatGenerator()
 35  extractor = LLMDocumentContentExtractor(chat_generator=chat_generator)
 36  documents = [
 37      Document(content="", meta={"file_path": "image.jpg"}),
 38      Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
 39  ]
 40  updated_documents = extractor.run(documents=documents)["documents"]
 41  print(updated_documents)
 42  # [Document(content='Extracted text from image.jpg',
 43  #           meta={'file_path': 'image.jpg'}),
 44  #  ...]
 45  ```
 46  
 47  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a>
 48  
 49  #### LLMDocumentContentExtractor.\_\_init\_\_
 50  
 51  ```python
 52  def __init__(*,
 53               chat_generator: ChatGenerator,
 54               prompt: str = DEFAULT_PROMPT_TEMPLATE,
 55               file_path_meta_field: str = "file_path",
 56               root_path: str | None = None,
 57               detail: Literal["auto", "high", "low"] | None = None,
 58               size: tuple[int, int] | None = None,
 59               raise_on_failure: bool = False,
 60               max_workers: int = 3)
 61  ```
 62  
 63  Initialize the LLMDocumentContentExtractor component.
 64  
 65  **Arguments**:
 66  
 67  - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must
 68  support vision-based input and return a plain text response.
 69  - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables.
 70  The prompt should only contain instructions on how to extract the content of the image-based document.
 71  - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF.
 72  - `root_path`: The root directory path where document files are located. If provided, file paths in
 73  document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 74  - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
 75  This will be passed to chat_generator when processing the images.
 76  - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while
 77  maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 78  when working with models that have resolution constraints or when transmitting images to remote services.
 79  - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged
 80  and returned.
 81  - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a
 82  ThreadPoolExecutor.
 83  
 84  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a>
 85  
 86  #### LLMDocumentContentExtractor.warm\_up
 87  
 88  ```python
 89  def warm_up()
 90  ```
 91  
 92  Warm up the ChatGenerator if it has a warm_up method.
 93  
 94  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a>
 95  
 96  #### LLMDocumentContentExtractor.to\_dict
 97  
 98  ```python
 99  def to_dict() -> dict[str, Any]
100  ```
101  
102  Serializes the component to a dictionary.
103  
104  **Returns**:
105  
106  Dictionary with serialized data.
107  
108  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a>
109  
110  #### LLMDocumentContentExtractor.from\_dict
111  
112  ```python
113  @classmethod
114  def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor"
115  ```
116  
117  Deserializes the component from a dictionary.
118  
119  **Arguments**:
120  
121  - `data`: Dictionary with serialized data.
122  
123  **Returns**:
124  
125  An instance of the component.
126  
127  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a>
128  
129  #### LLMDocumentContentExtractor.run
130  
131  ```python
132  @component.output_types(documents=list[Document],
133                          failed_documents=list[Document])
134  def run(documents: list[Document]) -> dict[str, list[Document]]
135  ```
136  
137  Run content extraction on a list of image-based documents using a vision-capable LLM.
138  
139  Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's
140  content. If the extraction fails, the document is returned in the `failed_documents` list with metadata
141  describing the failure.
142  
143  **Arguments**:
144  
145  - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata.
146  
147  **Returns**:
148  
149  A dictionary with:
150  - "documents": Successfully processed documents, updated with extracted content.
151  - "failed_documents": Documents that failed processing, annotated with failure metadata.
152  
153  <a id="llm_metadata_extractor"></a>
154  
155  ## Module llm\_metadata\_extractor
156  
157  <a id="llm_metadata_extractor.LLMMetadataExtractor"></a>
158  
159  ### LLMMetadataExtractor
160  
161  Extracts metadata from documents using a Large Language Model (LLM).
162  
163  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
164  
165  This component expects as input a list of documents and a prompt. The prompt should have a variable called
166  `document` that will point to a single document in the list of documents. So to access the content of the document,
167  you can use `{{ document.content }}` in the prompt.
168  
169  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
170  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
171  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
172  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
173  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
174  
175  ```python
176  from haystack import Document
177  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
178  from haystack.components.generators.chat import OpenAIChatGenerator
179  
180  NER_PROMPT = '''
181  -Goal-
182  Given text and a list of entity types, identify all entities of those types from the text.
183  
184  -Steps-
185  1. Identify all entities. For each identified entity, extract the following information:
186  - entity: Name of the entity
187  - entity_type: One of the following types: [organization, product, service, industry]
188  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
189  
190  2. Return output in a single list with all the entities identified in steps 1.
191  
192  -Examples-
193  ######################
194  Example 1:
195  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
196  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
197  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
198  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
199  base and high cross-border usage.
200  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
201  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
202  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
203  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
204  agreement with Emirates Skywards.
205  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
206  issuers are equally
207  ------------------------
208  output:
209  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
210  #############################
211  -Real Data-
212  ######################
213  entity_types: [company, organization, person, country, product, service]
214  text: {{ document.content }}
215  ######################
216  output:
217  '''
218  
219  docs = [
220      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
221      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
222  ]
223  
224  chat_generator = OpenAIChatGenerator(
225      generation_kwargs={
226          "max_completion_tokens": 500,
227          "temperature": 0.0,
228          "seed": 0,
229          "response_format": {
230              "type": "json_schema",
231              "json_schema": {
232                  "name": "entity_extraction",
233                  "schema": {
234                      "type": "object",
235                      "properties": {
236                          "entities": {
237                              "type": "array",
238                              "items": {
239                                  "type": "object",
240                                  "properties": {
241                                      "entity": {"type": "string"},
242                                      "entity_type": {"type": "string"}
243                                  },
244                                  "required": ["entity", "entity_type"],
245                                  "additionalProperties": False
246                              }
247                          }
248                      },
249                      "required": ["entities"],
250                      "additionalProperties": False
251                  }
252              }
253          },
254      },
255      max_retries=1,
256      timeout=60.0,
257  )
258  
259  extractor = LLMMetadataExtractor(
260      prompt=NER_PROMPT,
261      chat_generator=generator,
262      expected_keys=["entities"],
263      raise_on_failure=False,
264  )
265  
266  extractor.warm_up()
267  extractor.run(documents=docs)
268  >> {'documents': [
269      Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
270      meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
271            {'entity': 'Haystack', 'entity_type': 'product'}]}),
272      Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
273      meta: {'entities': [
274              {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
275              {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
276              ]})
277         ]
278      'failed_documents': []
279     }
280  >>
281  ```
282  
283  <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a>
284  
285  #### LLMMetadataExtractor.\_\_init\_\_
286  
287  ```python
288  def __init__(prompt: str,
289               chat_generator: ChatGenerator,
290               expected_keys: list[str] | None = None,
291               page_range: list[str | int] | None = None,
292               raise_on_failure: bool = False,
293               max_workers: int = 3)
294  ```
295  
296  Initializes the LLMMetadataExtractor.
297  
298  **Arguments**:
299  
300  - `prompt`: The prompt to be used for the LLM.
301  - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work,
302  the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
303  should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
304  - `expected_keys`: The keys expected in the JSON output from the LLM.
305  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
306  metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
307  ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
308  If None, metadata will be extracted from the entire document for each document in the documents list.
309  This parameter is optional and can be overridden in the `run` method.
310  - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or
311  validation of the JSON output.
312  - `max_workers`: The maximum number of workers to use in the thread pool executor.
313  
314  <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a>
315  
316  #### LLMMetadataExtractor.warm\_up
317  
318  ```python
319  def warm_up()
320  ```
321  
322  Warm up the LLM provider component.
323  
324  <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a>
325  
326  #### LLMMetadataExtractor.to\_dict
327  
328  ```python
329  def to_dict() -> dict[str, Any]
330  ```
331  
332  Serializes the component to a dictionary.
333  
334  **Returns**:
335  
336  Dictionary with serialized data.
337  
338  <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a>
339  
340  #### LLMMetadataExtractor.from\_dict
341  
342  ```python
343  @classmethod
344  def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor"
345  ```
346  
347  Deserializes the component from a dictionary.
348  
349  **Arguments**:
350  
351  - `data`: Dictionary with serialized data.
352  
353  **Returns**:
354  
355  An instance of the component.
356  
357  <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a>
358  
359  #### LLMMetadataExtractor.run
360  
361  ```python
362  @component.output_types(documents=list[Document],
363                          failed_documents=list[Document])
364  def run(documents: list[Document], page_range: list[str | int] | None = None)
365  ```
366  
367  Extract metadata from documents using a Large Language Model.
368  
369  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
370  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
371  extracted from the entire document if `page_range` is not provided.
372  
373  The original documents will be returned  updated with the extracted metadata.
374  
375  **Arguments**:
376  
377  - `documents`: List of documents to extract metadata from.
378  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
379  metadata from the first and third pages of each document. It also accepts printable range
380  strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
381  11, 12.
382  If None, metadata will be extracted from the entire document for each document in the
383  documents list.
384  
385  **Returns**:
386  
387  A dictionary with the keys:
388  - "documents": A list of documents that were successfully updated with the extracted metadata.
389  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
390  "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
391  re-run with the extractor to extract metadata.
392  
393  <a id="named_entity_extractor"></a>
394  
395  ## Module named\_entity\_extractor
396  
397  <a id="named_entity_extractor.NamedEntityExtractorBackend"></a>
398  
399  ### NamedEntityExtractorBackend
400  
401  NLP backend to use for Named Entity Recognition.
402  
403  <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a>
404  
405  #### HUGGING\_FACE
406  
407  Uses an Hugging Face model and pipeline.
408  
409  <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a>
410  
411  #### SPACY
412  
413  Uses a spaCy model and pipeline.
414  
415  <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a>
416  
417  #### NamedEntityExtractorBackend.from\_str
418  
419  ```python
420  @staticmethod
421  def from_str(string: str) -> "NamedEntityExtractorBackend"
422  ```
423  
424  Convert a string to a NamedEntityExtractorBackend enum.
425  
426  <a id="named_entity_extractor.NamedEntityAnnotation"></a>
427  
428  ### NamedEntityAnnotation
429  
430  Describes a single NER annotation.
431  
432  **Arguments**:
433  
434  - `entity`: Entity label.
435  - `start`: Start index of the entity in the document.
436  - `end`: End index of the entity in the document.
437  - `score`: Score calculated by the model.
438  
439  <a id="named_entity_extractor.NamedEntityExtractor"></a>
440  
441  ### NamedEntityExtractor
442  
443  Annotates named entities in a collection of documents.
444  
445  The component supports two backends: Hugging Face and spaCy. The
446  former can be used with any sequence classification model from the
447  [Hugging Face model hub](https://huggingface.co/models), while the
448  latter can be used with any [spaCy model](https://spacy.io/models)
449  that contains an NER component. Annotations are stored as metadata
450  in the documents.
451  
452  Usage example:
453  ```python
454  from haystack import Document
455  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
456  
457  documents = [
458      Document(content="I'm Merlin, the happy pig!"),
459      Document(content="My name is Clara and I live in Berkeley, California."),
460  ]
461  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
462  extractor.warm_up()
463  results = extractor.run(documents=documents)["documents"]
464  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
465  print(annotations)
466  ```
467  
468  <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a>
469  
470  #### NamedEntityExtractor.\_\_init\_\_
471  
472  ```python
473  def __init__(
474      *,
475      backend: str | NamedEntityExtractorBackend,
476      model: str,
477      pipeline_kwargs: dict[str, Any] | None = None,
478      device: ComponentDevice | None = None,
479      token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"],
480                                                 strict=False)
481  ) -> None
482  ```
483  
484  Create a Named Entity extractor component.
485  
486  **Arguments**:
487  
488  - `backend`: Backend to use for NER.
489  - `model`: Name of the model or a path to the model on
490  the local disk. Dependent on the backend.
491  - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The
492  pipeline can override these arguments. Dependent on the backend.
493  - `device`: The device on which the model is loaded. If `None`,
494  the default device is automatically selected. If a
495  device/device map is specified in `pipeline_kwargs`,
496  it overrides this parameter (only applicable to the
497  HuggingFace backend).
498  - `token`: The API token to download private models from Hugging Face.
499  
500  <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a>
501  
502  #### NamedEntityExtractor.warm\_up
503  
504  ```python
505  def warm_up()
506  ```
507  
508  Initialize the component.
509  
510  **Raises**:
511  
512  - `ComponentError`: If the backend fails to initialize successfully.
513  
514  <a id="named_entity_extractor.NamedEntityExtractor.run"></a>
515  
516  #### NamedEntityExtractor.run
517  
518  ```python
519  @component.output_types(documents=list[Document])
520  def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
521  ```
522  
523  Annotate named entities in each document and store the annotations in the document's metadata.
524  
525  **Arguments**:
526  
527  - `documents`: Documents to process.
528  - `batch_size`: Batch size used for processing the documents.
529  
530  **Raises**:
531  
532  - `ComponentError`: If the backend fails to process a document.
533  
534  **Returns**:
535  
536  Processed documents.
537  
538  <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a>
539  
540  #### NamedEntityExtractor.to\_dict
541  
542  ```python
543  def to_dict() -> dict[str, Any]
544  ```
545  
546  Serializes the component to a dictionary.
547  
548  **Returns**:
549  
550  Dictionary with serialized data.
551  
552  <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a>
553  
554  #### NamedEntityExtractor.from\_dict
555  
556  ```python
557  @classmethod
558  def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor"
559  ```
560  
561  Deserializes the component from a dictionary.
562  
563  **Arguments**:
564  
565  - `data`: Dictionary to deserialize from.
566  
567  **Returns**:
568  
569  Deserialized component.
570  
571  <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a>
572  
573  #### NamedEntityExtractor.initialized
574  
575  ```python
576  @property
577  def initialized() -> bool
578  ```
579  
580  Returns if the extractor is ready to annotate text.
581  
582  <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a>
583  
584  #### NamedEntityExtractor.get\_stored\_annotations
585  
586  ```python
587  @classmethod
588  def get_stored_annotations(
589          cls, document: Document) -> list[NamedEntityAnnotation] | None
590  ```
591  
592  Returns the document's named entity annotations stored in its metadata, if any.
593  
594  **Arguments**:
595  
596  - `document`: Document whose annotations are to be fetched.
597  
598  **Returns**:
599  
600  The stored annotations.
601  
602  <a id="regex_text_extractor"></a>
603  
604  ## Module regex\_text\_extractor
605  
606  <a id="regex_text_extractor.RegexTextExtractor"></a>
607  
608  ### RegexTextExtractor
609  
610  Extracts text from chat message or string input using a regex pattern.
611  
612  RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
613  It can be configured to search through all messages or only the last message in a list of ChatMessages.
614  
615  ### Usage example
616  
617  ```python
618  from haystack.components.extractors import RegexTextExtractor
619  from haystack.dataclasses import ChatMessage
620  
621  # Using with a string
622  parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">')
623  result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
624  # result: {"captured_text": "github.com/hahahaha"}
625  
626  # Using with ChatMessages
627  messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
628  result = parser.run(text_or_messages=messages)
629  # result: {"captured_text": "github.com/hahahaha"}
630  ```
631  
632  <a id="regex_text_extractor.RegexTextExtractor.__init__"></a>
633  
634  #### RegexTextExtractor.\_\_init\_\_
635  
636  ```python
637  def __init__(regex_pattern: str)
638  ```
639  
640  Creates an instance of the RegexTextExtractor component.
641  
642  **Arguments**:
643  
644  - `regex_pattern`: The regular expression pattern used to extract text.
645  The pattern should include a capture group to extract the desired text.
646  Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`.
647  
648  <a id="regex_text_extractor.RegexTextExtractor.to_dict"></a>
649  
650  #### RegexTextExtractor.to\_dict
651  
652  ```python
653  def to_dict() -> dict[str, Any]
654  ```
655  
656  Serializes the component to a dictionary.
657  
658  **Returns**:
659  
660  Dictionary with serialized data.
661  
662  <a id="regex_text_extractor.RegexTextExtractor.from_dict"></a>
663  
664  #### RegexTextExtractor.from\_dict
665  
666  ```python
667  @classmethod
668  def from_dict(cls, data: dict[str, Any]) -> "RegexTextExtractor"
669  ```
670  
671  Deserializes the component from a dictionary.
672  
673  **Arguments**:
674  
675  - `data`: The dictionary to deserialize from.
676  
677  **Returns**:
678  
679  The deserialized component.
680  
681  <a id="regex_text_extractor.RegexTextExtractor.run"></a>
682  
683  #### RegexTextExtractor.run
684  
685  ```python
686  @component.output_types(captured_text=str)
687  def run(text_or_messages: str | list[ChatMessage]) -> dict[str, str]
688  ```
689  
690  Extracts text from input using the configured regex pattern.
691  
692  **Arguments**:
693  
694  - `text_or_messages`: Either a string or a list of ChatMessage objects to search through.
695  
696  **Raises**:
697  
698  - `None`: - ValueError: if receiving a list the last element is not a ChatMessage instance.
699  
700  **Returns**:
701  
702  - `{"captured_text": "matched text"}` if a match is found
703  - `{"captured_text": ""}` if no match is found
704