Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.28 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: "Extractors"
  3  id: extractors-api
  4  description: "Components to extract specific elements from textual data."
  5  slug: "/extractors-api"
  6  ---
  7  
  8  
  9  ## image/llm_document_content_extractor
 10  
 11  ### LLMDocumentContentExtractor
 12  
 13  Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM.
 14  
 15  One prompt and one LLM call per document. The component converts each document to an image via
 16  DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables.
 17  
 18  Response handling:
 19  
 20  - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content.
 21  - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content.
 22  - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is
 23    written to content and all other keys are merged into the document's metadata.
 24  
 25  The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}`
 26  in `generation_kwargs`).
 27  
 28  Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata.
 29  
 30  ### Usage example
 31  
 32  ```python
 33  from haystack import Document
 34  from haystack.components.generators.chat import OpenAIChatGenerator
 35  from haystack.components.extractors.image import LLMDocumentContentExtractor
 36  
 37  prompt = """
 38  Extract the content from the provided image.
 39  Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'.
 40  No markdown, no code fence, only raw JSON.
 41  
 42  Extract metadata about the image like source of the image, date of creation, etc. if you can.
 43  Return this metadata as additional key-value pairs in the same JSON object.
 44  """
 45  
 46  chat_generator = OpenAIChatGenerator(
 47          generation_kwargs={
 48                  "response_format": {
 49                      "type": "json_schema",
 50                      "json_schema": {
 51                          "name": "entity_extraction",
 52                          "schema": {
 53                              "type": "object",
 54                              "properties": {
 55                                  "document_content": {"type": "string"},
 56                                  "author": {"type": "string"},
 57                                  "date": {"type": "string"},
 58                                  "document_type": {"type": "string"},
 59                                  "title": {"type": "string"},
 60                              },
 61                              "additionalProperties": False,
 62                          },
 63                      },
 64                  }
 65              }
 66          )
 67  
 68  extractor = LLMDocumentContentExtractor(
 69      chat_generator=chat_generator,
 70      file_path_meta_field="file_path",
 71      raise_on_failure=False
 72  )
 73  
 74  documents = [
 75      Document(content="", meta={"file_path": "test/test_files/images/image_metadata.png"}),
 76      Document(content="", meta={"file_path": "test/test_files/images/apple.jpg", "page_number": 1})
 77  ]
 78  result = extractor.run(documents=documents)
 79  updated_documents = result["documents"]
 80  ```
 81  
 82  #### __init__
 83  
 84  ```python
 85  __init__(
 86      *,
 87      chat_generator: ChatGenerator,
 88      prompt: str = DEFAULT_PROMPT_TEMPLATE,
 89      file_path_meta_field: str = "file_path",
 90      root_path: str | None = None,
 91      detail: Literal["auto", "high", "low"] | None = None,
 92      size: tuple[int, int] | None = None,
 93      raise_on_failure: bool = False,
 94      max_workers: int = 3
 95  ) -> None
 96  ```
 97  
 98  Initialize the LLMDocumentContentExtractor component.
 99  
100  **Parameters:**
101  
102  - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON
103    (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`).
104  - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables.
105  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
106  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
107    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
108  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
109  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio.
110  - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned.
111  - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls.
112  
113  #### warm_up
114  
115  ```python
116  warm_up() -> None
117  ```
118  
119  Warm up the ChatGenerator if it has a warm_up method.
120  
121  #### to_dict
122  
123  ```python
124  to_dict() -> dict[str, Any]
125  ```
126  
127  Serializes the component to a dictionary.
128  
129  **Returns:**
130  
131  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
132  
133  #### from_dict
134  
135  ```python
136  from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor
137  ```
138  
139  Deserializes the component from a dictionary.
140  
141  **Parameters:**
142  
143  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
144  
145  **Returns:**
146  
147  - <code>LLMDocumentContentExtractor</code> – An instance of the component.
148  
149  #### run
150  
151  ```python
152  run(documents: list[Document]) -> dict[str, list[Document]]
153  ```
154  
155  Run extraction on image-based documents. One LLM call per document.
156  
157  **Parameters:**
158  
159  - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata.
160  
161  **Returns:**
162  
163  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata).
164  
165  ## llm_metadata_extractor
166  
167  ### LLMMetadataExtractor
168  
169  Extracts metadata from documents using a Large Language Model (LLM).
170  
171  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
172  
173  This component expects as input a list of documents and a prompt. The prompt should have a variable called
174  `document` that will point to a single document in the list of documents. So to access the content of the document,
175  you can use `{{ document.content }}` in the prompt.
176  
177  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
178  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
179  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
180  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
181  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
182  
183  ```python
184  from haystack import Document
185  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
186  from haystack.components.generators.chat import OpenAIChatGenerator
187  
188  NER_PROMPT = '''
189  -Goal-
190  Given text and a list of entity types, identify all entities of those types from the text.
191  
192  -Steps-
193  1. Identify all entities. For each identified entity, extract the following information:
194  - entity: Name of the entity
195  - entity_type: One of the following types: [organization, product, service, industry]
196  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
197  
198  2. Return output in a single list with all the entities identified in steps 1.
199  
200  -Examples-
201  ######################
202  Example 1:
203  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
204  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
205  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
206  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
207  base and high cross-border usage.
208  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
209  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
210  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
211  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
212  agreement with Emirates Skywards.
213  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
214  issuers are equally
215  ------------------------
216  output:
217  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
218  #############################
219  -Real Data-
220  ######################
221  entity_types: [company, organization, person, country, product, service]
222  text: {{ document.content }}
223  ######################
224  output:
225  '''
226  
227  docs = [
228      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
229      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
230  ]
231  
232  chat_generator = OpenAIChatGenerator(
233      generation_kwargs={
234          "max_completion_tokens": 500,
235          "temperature": 0.0,
236          "seed": 0,
237          "response_format": {
238              "type": "json_schema",
239              "json_schema": {
240                  "name": "entity_extraction",
241                  "schema": {
242                      "type": "object",
243                      "properties": {
244                          "entities": {
245                              "type": "array",
246                              "items": {
247                                  "type": "object",
248                                  "properties": {
249                                      "entity": {"type": "string"},
250                                      "entity_type": {"type": "string"}
251                                  },
252                                  "required": ["entity", "entity_type"],
253                                  "additionalProperties": False
254                              }
255                          }
256                      },
257                      "required": ["entities"],
258                      "additionalProperties": False
259                  }
260              }
261          },
262      },
263      max_retries=1,
264      timeout=60.0,
265  )
266  
267  extractor = LLMMetadataExtractor(
268      prompt=NER_PROMPT,
269      chat_generator=chat_generator,
270      expected_keys=["entities"],
271      raise_on_failure=False,
272  )
273  
274  extractor.run(documents=docs)
275  # >> {'documents': [
276  #     Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
277  #     meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
278  #           {'entity': 'Haystack', 'entity_type': 'product'}]}),
279  #     Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
280  #     meta: {'entities': [
281  #             {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
282  #             {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
283  #             ]})
284  #        ]
285  #     'failed_documents': []
286  #    }
287  # >>
288  ```
289  
290  #### __init__
291  
292  ```python
293  __init__(
294      prompt: str,
295      chat_generator: ChatGenerator,
296      expected_keys: list[str] | None = None,
297      page_range: list[str | int] | None = None,
298      raise_on_failure: bool = False,
299      max_workers: int = 3,
300  ) -> None
301  ```
302  
303  Initializes the LLMMetadataExtractor.
304  
305  **Parameters:**
306  
307  - **prompt** (<code>str</code>) – The prompt to be used for the LLM.
308  - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work,
309    the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
310    should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
311  - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM.
312  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
313    metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
314    ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
315    If None, metadata will be extracted from the entire document for each document in the documents list.
316    This parameter is optional and can be overridden in the `run` method.
317  - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or
318    validation of the JSON output.
319  - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor.
320    This parameter is used limit the maximum number of requests that should be allowed to run concurrently
321    when using the `run_async` method.
322  
323  #### warm_up
324  
325  ```python
326  warm_up() -> None
327  ```
328  
329  Warm up the LLM provider component.
330  
331  #### to_dict
332  
333  ```python
334  to_dict() -> dict[str, Any]
335  ```
336  
337  Serializes the component to a dictionary.
338  
339  **Returns:**
340  
341  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
342  
343  #### from_dict
344  
345  ```python
346  from_dict(data: dict[str, Any]) -> LLMMetadataExtractor
347  ```
348  
349  Deserializes the component from a dictionary.
350  
351  **Parameters:**
352  
353  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
354  
355  **Returns:**
356  
357  - <code>LLMMetadataExtractor</code> – An instance of the component.
358  
359  #### run
360  
361  ```python
362  run(
363      documents: list[Document], page_range: list[str | int] | None = None
364  ) -> dict[str, Any]
365  ```
366  
367  Extract metadata from documents using a Large Language Model.
368  
369  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
370  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
371  extracted from the entire document if `page_range` is not provided.
372  
373  The original documents will be returned updated with the extracted metadata.
374  
375  **Parameters:**
376  
377  - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from.
378  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
379    metadata from the first and third pages of each document. It also accepts printable range
380    strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
381    11, 12.
382    If None, metadata will be extracted from the entire document for each document in the
383    documents list.
384  
385  **Returns:**
386  
387  - <code>dict\[str, Any\]</code> – A dictionary with the keys:
388  - "documents": A list of documents that were successfully updated with the extracted metadata.
389  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
390    "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
391    re-run with the extractor to extract metadata.
392  
393  #### run_async
394  
395  ```python
396  run_async(
397      documents: list[Document], page_range: list[str | int] | None = None
398  ) -> dict[str, Any]
399  ```
400  
401  Asynchronously extract metadata from documents using a Large Language Model.
402  
403  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
404  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
405  extracted from the entire document if `page_range` is not provided.
406  
407  The original documents will be returned updated with the extracted metadata.
408  
409  This is the asynchronous version of the `run` method. It has the same parameters
410  and return values but can be used with `await` in an async code.
411  
412  **Parameters:**
413  
414  - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from.
415  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
416    metadata from the first and third pages of each document. It also accepts printable range
417    strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
418    11, 12.
419    If None, metadata will be extracted from the entire document for each document in the
420    documents list.
421  
422  **Returns:**
423  
424  - <code>dict\[str, Any\]</code> – A dictionary with the keys:
425  - "documents": A list of documents that were successfully updated with the extracted metadata.
426  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
427    "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
428    re-run with the extractor to extract metadata.
429  
430  ## named_entity_extractor
431  
432  ### NamedEntityExtractorBackend
433  
434  Bases: <code>Enum</code>
435  
436  NLP backend to use for Named Entity Recognition.
437  
438  #### from_str
439  
440  ```python
441  from_str(string: str) -> NamedEntityExtractorBackend
442  ```
443  
444  Convert a string to a NamedEntityExtractorBackend enum.
445  
446  ### NamedEntityAnnotation
447  
448  Describes a single NER annotation.
449  
450  **Parameters:**
451  
452  - **entity** (<code>str</code>) – Entity label.
453  - **start** (<code>int</code>) – Start index of the entity in the document.
454  - **end** (<code>int</code>) – End index of the entity in the document.
455  - **score** (<code>float | None</code>) – Score calculated by the model.
456  
457  ### NamedEntityExtractor
458  
459  Annotates named entities in a collection of documents.
460  
461  The component supports two backends: Hugging Face and spaCy. The
462  former can be used with any sequence classification model from the
463  [Hugging Face model hub](https://huggingface.co/models), while the
464  latter can be used with any [spaCy model](https://spacy.io/models)
465  that contains an NER component. Annotations are stored as metadata
466  in the documents.
467  
468  Usage example:
469  
470  <!-- test-ignore -->
471  
472  ```python
473  from haystack import Document
474  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
475  
476  documents = [
477      Document(content="I'm Merlin, the happy pig!"),
478      Document(content="My name is Clara and I live in Berkeley, California."),
479  ]
480  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
481  results = extractor.run(documents=documents)["documents"]
482  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
483  print(annotations)
484  ```
485  
486  #### __init__
487  
488  ```python
489  __init__(
490      *,
491      backend: str | NamedEntityExtractorBackend,
492      model: str,
493      pipeline_kwargs: dict[str, Any] | None = None,
494      device: ComponentDevice | None = None,
495      token: Secret | None = Secret.from_env_var(
496          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
497      )
498  ) -> None
499  ```
500  
501  Create a Named Entity extractor component.
502  
503  **Parameters:**
504  
505  - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER.
506  - **model** (<code>str</code>) – Name of the model or a path to the model on
507    the local disk. Dependent on the backend.
508  - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The
509    pipeline can override these arguments. Dependent on the backend.
510  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`,
511    the default device is automatically selected. If a
512    device/device map is specified in `pipeline_kwargs`,
513    it overrides this parameter (only applicable to the
514    HuggingFace backend).
515  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
516  
517  #### warm_up
518  
519  ```python
520  warm_up() -> None
521  ```
522  
523  Initialize the component.
524  
525  **Raises:**
526  
527  - <code>ComponentError</code> – If the backend fails to initialize successfully.
528  
529  #### run
530  
531  ```python
532  run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
533  ```
534  
535  Annotate named entities in each document and store the annotations in the document's metadata.
536  
537  **Parameters:**
538  
539  - **documents** (<code>list\[Document\]</code>) – Documents to process.
540  - **batch_size** (<code>int</code>) – Batch size used for processing the documents.
541  
542  **Returns:**
543  
544  - <code>dict\[str, Any\]</code> – Processed documents.
545  
546  **Raises:**
547  
548  - <code>ComponentError</code> – If the backend fails to process a document.
549  
550  #### to_dict
551  
552  ```python
553  to_dict() -> dict[str, Any]
554  ```
555  
556  Serializes the component to a dictionary.
557  
558  **Returns:**
559  
560  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
561  
562  #### from_dict
563  
564  ```python
565  from_dict(data: dict[str, Any]) -> NamedEntityExtractor
566  ```
567  
568  Deserializes the component from a dictionary.
569  
570  **Parameters:**
571  
572  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
573  
574  **Returns:**
575  
576  - <code>NamedEntityExtractor</code> – Deserialized component.
577  
578  #### initialized
579  
580  ```python
581  initialized: bool
582  ```
583  
584  Returns if the extractor is ready to annotate text.
585  
586  #### get_stored_annotations
587  
588  ```python
589  get_stored_annotations(
590      document: Document,
591  ) -> list[NamedEntityAnnotation] | None
592  ```
593  
594  Returns the document's named entity annotations stored in its metadata, if any.
595  
596  **Parameters:**
597  
598  - **document** (<code>Document</code>) – Document whose annotations are to be fetched.
599  
600  **Returns:**
601  
602  - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations.
603  
604  ## regex_text_extractor
605  
606  ### RegexTextExtractor
607  
608  Extracts text from chat message or string input using a regex pattern.
609  
610  RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
611  It can be configured to search through all messages or only the last message in a list of ChatMessages.
612  
613  ### Usage example
614  
615  ```python
616  from haystack.components.extractors import RegexTextExtractor
617  from haystack.dataclasses import ChatMessage
618  
619  # Using with a string
620  parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">')
621  result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
622  # result: {"captured_text": "github.com/hahahaha"}
623  
624  # Using with ChatMessages
625  messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
626  result = parser.run(text_or_messages=messages)
627  # result: {"captured_text": "github.com/hahahaha"}
628  ```
629  
630  #### __init__
631  
632  ```python
633  __init__(regex_pattern: str) -> None
634  ```
635  
636  Creates an instance of the RegexTextExtractor component.
637  
638  **Parameters:**
639  
640  - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text.
641    The pattern should include a capture group to extract the desired text.
642    Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`.
643  
644  #### to_dict
645  
646  ```python
647  to_dict() -> dict[str, Any]
648  ```
649  
650  Serializes the component to a dictionary.
651  
652  **Returns:**
653  
654  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
655  
656  #### from_dict
657  
658  ```python
659  from_dict(data: dict[str, Any]) -> RegexTextExtractor
660  ```
661  
662  Deserializes the component from a dictionary.
663  
664  **Parameters:**
665  
666  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
667  
668  **Returns:**
669  
670  - <code>RegexTextExtractor</code> – The deserialized component.
671  
672  #### run
673  
674  ```python
675  run(text_or_messages: str | list[ChatMessage]) -> dict[str, str]
676  ```
677  
678  Extracts text from input using the configured regex pattern.
679  
680  **Parameters:**
681  
682  - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through.
683  
684  **Returns:**
685  
686  - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found
687  - `{"captured_text": ""}` if no match is found
688  
689  **Raises:**
690  
691  - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.