Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.27 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: "Extractors"
  3  id: extractors-api
  4  description: "Components to extract specific elements from textual data."
  5  slug: "/extractors-api"
  6  ---
  7  
  8  
  9  ## image/llm_document_content_extractor
 10  
 11  ### LLMDocumentContentExtractor
 12  
 13  Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM.
 14  
 15  One prompt and one LLM call per document. The component converts each document to an image via
 16  DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables.
 17  
 18  Response handling:
 19  
 20  - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content.
 21  - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content.
 22  - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is
 23    written to content and all other keys are merged into the document's metadata.
 24  
 25  The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}`
 26  in `generation_kwargs`).
 27  
 28  Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata.
 29  
 30  ### Usage example
 31  
 32  ```python
 33  from haystack import Document
 34  from haystack.components.generators.chat import OpenAIChatGenerator
 35  from haystack.components.extractors.image import LLMDocumentContentExtractor
 36  
 37  prompt = """
 38  Extract the content from the provided image.
 39  Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'.
 40  No markdown, no code fence, only raw JSON.
 41  
 42  Extract metadata about the image like source of the image, date of creation, etc. if you can.
 43  Return this metadata as additional key-value pairs in the same JSON object.
 44  """
 45  
 46  chat_generator = OpenAIChatGenerator()
 47  extractor = LLMDocumentContentExtractor(
 48      chat_generator=chat_generator,
 49      generation_kwargs={
 50          "response_format": {
 51              "type": "json_schema",
 52              "json_schema": {
 53                  "name": "entity_extraction",
 54                  "schema": {
 55                      "type": "object",
 56                      "properties": {
 57                          "document_content": {"type": "string"},
 58                          "author": {"type": "string"},
 59                          "date": {"type": "string"},
 60                          "document_type": {"type": "string"},
 61                          "title": {"type": "string"},
 62                      },
 63                      "additionalProperties": False,
 64                  },
 65              },
 66          }
 67      }
 68  )
 69  documents = [
 70      Document(content="", meta={"file_path": "image.jpg"}),
 71      Document(content="", meta={"file_path": "document.pdf", "page_number": 1})
 72  ]
 73  result = extractor.run(documents=documents)
 74  updated_documents = result["documents"]
 75  ```
 76  
 77  #### __init__
 78  
 79  ```python
 80  __init__(
 81      *,
 82      chat_generator: ChatGenerator,
 83      prompt: str = DEFAULT_PROMPT_TEMPLATE,
 84      file_path_meta_field: str = "file_path",
 85      root_path: str | None = None,
 86      detail: Literal["auto", "high", "low"] | None = None,
 87      size: tuple[int, int] | None = None,
 88      raise_on_failure: bool = False,
 89      max_workers: int = 3
 90  ) -> None
 91  ```
 92  
 93  Initialize the LLMDocumentContentExtractor component.
 94  
 95  **Parameters:**
 96  
 97  - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON
 98    (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`).
 99  - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables.
100  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
101  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
102    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
103  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
104  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio.
105  - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned.
106  - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls.
107  
108  #### warm_up
109  
110  ```python
111  warm_up() -> None
112  ```
113  
114  Warm up the ChatGenerator if it has a warm_up method.
115  
116  #### to_dict
117  
118  ```python
119  to_dict() -> dict[str, Any]
120  ```
121  
122  Serializes the component to a dictionary.
123  
124  **Returns:**
125  
126  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
127  
128  #### from_dict
129  
130  ```python
131  from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor
132  ```
133  
134  Deserializes the component from a dictionary.
135  
136  **Parameters:**
137  
138  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
139  
140  **Returns:**
141  
142  - <code>LLMDocumentContentExtractor</code> – An instance of the component.
143  
144  #### run
145  
146  ```python
147  run(documents: list[Document]) -> dict[str, list[Document]]
148  ```
149  
150  Run extraction on image-based documents. One LLM call per document.
151  
152  **Parameters:**
153  
154  - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata.
155  
156  **Returns:**
157  
158  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata).
159  
160  ## llm_metadata_extractor
161  
162  ### LLMMetadataExtractor
163  
164  Extracts metadata from documents using a Large Language Model (LLM).
165  
166  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
167  
168  This component expects as input a list of documents and a prompt. The prompt should have a variable called
169  `document` that will point to a single document in the list of documents. So to access the content of the document,
170  you can use `{{ document.content }}` in the prompt.
171  
172  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
173  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
174  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
175  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
176  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
177  
178  ```python
179  from haystack import Document
180  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
181  from haystack.components.generators.chat import OpenAIChatGenerator
182  
183  NER_PROMPT = '''
184  -Goal-
185  Given text and a list of entity types, identify all entities of those types from the text.
186  
187  -Steps-
188  1. Identify all entities. For each identified entity, extract the following information:
189  - entity: Name of the entity
190  - entity_type: One of the following types: [organization, product, service, industry]
191  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
192  
193  2. Return output in a single list with all the entities identified in steps 1.
194  
195  -Examples-
196  ######################
197  Example 1:
198  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
199  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
200  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
201  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
202  base and high cross-border usage.
203  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
204  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
205  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
206  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
207  agreement with Emirates Skywards.
208  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
209  issuers are equally
210  ------------------------
211  output:
212  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
213  #############################
214  -Real Data-
215  ######################
216  entity_types: [company, organization, person, country, product, service]
217  text: {{ document.content }}
218  ######################
219  output:
220  '''
221  
222  docs = [
223      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
224      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
225  ]
226  
227  chat_generator = OpenAIChatGenerator(
228      generation_kwargs={
229          "max_completion_tokens": 500,
230          "temperature": 0.0,
231          "seed": 0,
232          "response_format": {
233              "type": "json_schema",
234              "json_schema": {
235                  "name": "entity_extraction",
236                  "schema": {
237                      "type": "object",
238                      "properties": {
239                          "entities": {
240                              "type": "array",
241                              "items": {
242                                  "type": "object",
243                                  "properties": {
244                                      "entity": {"type": "string"},
245                                      "entity_type": {"type": "string"}
246                                  },
247                                  "required": ["entity", "entity_type"],
248                                  "additionalProperties": False
249                              }
250                          }
251                      },
252                      "required": ["entities"],
253                      "additionalProperties": False
254                  }
255              }
256          },
257      },
258      max_retries=1,
259      timeout=60.0,
260  )
261  
262  extractor = LLMMetadataExtractor(
263      prompt=NER_PROMPT,
264      chat_generator=generator,
265      expected_keys=["entities"],
266      raise_on_failure=False,
267  )
268  
269  extractor.run(documents=docs)
270  # >> {'documents': [
271  #     Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
272  #     meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
273  #           {'entity': 'Haystack', 'entity_type': 'product'}]}),
274  #     Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
275  #     meta: {'entities': [
276  #             {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
277  #             {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
278  #             ]})
279  #        ]
280  #     'failed_documents': []
281  #    }
282  # >>
283  ```
284  
285  #### __init__
286  
287  ```python
288  __init__(
289      prompt: str,
290      chat_generator: ChatGenerator,
291      expected_keys: list[str] | None = None,
292      page_range: list[str | int] | None = None,
293      raise_on_failure: bool = False,
294      max_workers: int = 3,
295  ) -> None
296  ```
297  
298  Initializes the LLMMetadataExtractor.
299  
300  **Parameters:**
301  
302  - **prompt** (<code>str</code>) – The prompt to be used for the LLM.
303  - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work,
304    the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
305    should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
306  - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM.
307  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
308    metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
309    ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
310    If None, metadata will be extracted from the entire document for each document in the documents list.
311    This parameter is optional and can be overridden in the `run` method.
312  - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or
313    validation of the JSON output.
314  - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor.
315  
316  #### warm_up
317  
318  ```python
319  warm_up() -> None
320  ```
321  
322  Warm up the LLM provider component.
323  
324  #### to_dict
325  
326  ```python
327  to_dict() -> dict[str, Any]
328  ```
329  
330  Serializes the component to a dictionary.
331  
332  **Returns:**
333  
334  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
335  
336  #### from_dict
337  
338  ```python
339  from_dict(data: dict[str, Any]) -> LLMMetadataExtractor
340  ```
341  
342  Deserializes the component from a dictionary.
343  
344  **Parameters:**
345  
346  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
347  
348  **Returns:**
349  
350  - <code>LLMMetadataExtractor</code> – An instance of the component.
351  
352  #### run
353  
354  ```python
355  run(
356      documents: list[Document], page_range: list[str | int] | None = None
357  ) -> dict[str, Any]
358  ```
359  
360  Extract metadata from documents using a Large Language Model.
361  
362  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
363  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
364  extracted from the entire document if `page_range` is not provided.
365  
366  The original documents will be returned updated with the extracted metadata.
367  
368  **Parameters:**
369  
370  - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from.
371  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
372    metadata from the first and third pages of each document. It also accepts printable range
373    strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
374    11, 12.
375    If None, metadata will be extracted from the entire document for each document in the
376    documents list.
377  
378  **Returns:**
379  
380  - <code>dict\[str, Any\]</code> – A dictionary with the keys:
381  - "documents": A list of documents that were successfully updated with the extracted metadata.
382  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
383    "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
384    re-run with the extractor to extract metadata.
385  
386  ## named_entity_extractor
387  
388  ### NamedEntityExtractorBackend
389  
390  Bases: <code>Enum</code>
391  
392  NLP backend to use for Named Entity Recognition.
393  
394  #### from_str
395  
396  ```python
397  from_str(string: str) -> NamedEntityExtractorBackend
398  ```
399  
400  Convert a string to a NamedEntityExtractorBackend enum.
401  
402  ### NamedEntityAnnotation
403  
404  Describes a single NER annotation.
405  
406  **Parameters:**
407  
408  - **entity** (<code>str</code>) – Entity label.
409  - **start** (<code>int</code>) – Start index of the entity in the document.
410  - **end** (<code>int</code>) – End index of the entity in the document.
411  - **score** (<code>float | None</code>) – Score calculated by the model.
412  
413  ### NamedEntityExtractor
414  
415  Annotates named entities in a collection of documents.
416  
417  The component supports two backends: Hugging Face and spaCy. The
418  former can be used with any sequence classification model from the
419  [Hugging Face model hub](https://huggingface.co/models), while the
420  latter can be used with any [spaCy model](https://spacy.io/models)
421  that contains an NER component. Annotations are stored as metadata
422  in the documents.
423  
424  Usage example:
425  
426  ```python
427  from haystack import Document
428  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
429  
430  documents = [
431      Document(content="I'm Merlin, the happy pig!"),
432      Document(content="My name is Clara and I live in Berkeley, California."),
433  ]
434  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
435  results = extractor.run(documents=documents)["documents"]
436  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
437  print(annotations)
438  ```
439  
440  #### __init__
441  
442  ```python
443  __init__(
444      *,
445      backend: str | NamedEntityExtractorBackend,
446      model: str,
447      pipeline_kwargs: dict[str, Any] | None = None,
448      device: ComponentDevice | None = None,
449      token: Secret | None = Secret.from_env_var(
450          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
451      )
452  ) -> None
453  ```
454  
455  Create a Named Entity extractor component.
456  
457  **Parameters:**
458  
459  - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER.
460  - **model** (<code>str</code>) – Name of the model or a path to the model on
461    the local disk. Dependent on the backend.
462  - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The
463    pipeline can override these arguments. Dependent on the backend.
464  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`,
465    the default device is automatically selected. If a
466    device/device map is specified in `pipeline_kwargs`,
467    it overrides this parameter (only applicable to the
468    HuggingFace backend).
469  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
470  
471  #### warm_up
472  
473  ```python
474  warm_up() -> None
475  ```
476  
477  Initialize the component.
478  
479  **Raises:**
480  
481  - <code>ComponentError</code> – If the backend fails to initialize successfully.
482  
483  #### run
484  
485  ```python
486  run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
487  ```
488  
489  Annotate named entities in each document and store the annotations in the document's metadata.
490  
491  **Parameters:**
492  
493  - **documents** (<code>list\[Document\]</code>) – Documents to process.
494  - **batch_size** (<code>int</code>) – Batch size used for processing the documents.
495  
496  **Returns:**
497  
498  - <code>dict\[str, Any\]</code> – Processed documents.
499  
500  **Raises:**
501  
502  - <code>ComponentError</code> – If the backend fails to process a document.
503  
504  #### to_dict
505  
506  ```python
507  to_dict() -> dict[str, Any]
508  ```
509  
510  Serializes the component to a dictionary.
511  
512  **Returns:**
513  
514  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
515  
516  #### from_dict
517  
518  ```python
519  from_dict(data: dict[str, Any]) -> NamedEntityExtractor
520  ```
521  
522  Deserializes the component from a dictionary.
523  
524  **Parameters:**
525  
526  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
527  
528  **Returns:**
529  
530  - <code>NamedEntityExtractor</code> – Deserialized component.
531  
532  #### initialized
533  
534  ```python
535  initialized: bool
536  ```
537  
538  Returns if the extractor is ready to annotate text.
539  
540  #### get_stored_annotations
541  
542  ```python
543  get_stored_annotations(
544      document: Document,
545  ) -> list[NamedEntityAnnotation] | None
546  ```
547  
548  Returns the document's named entity annotations stored in its metadata, if any.
549  
550  **Parameters:**
551  
552  - **document** (<code>Document</code>) – Document whose annotations are to be fetched.
553  
554  **Returns:**
555  
556  - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations.
557  
558  ## regex_text_extractor
559  
560  ### RegexTextExtractor
561  
562  Extracts text from chat message or string input using a regex pattern.
563  
564  RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
565  It can be configured to search through all messages or only the last message in a list of ChatMessages.
566  
567  ### Usage example
568  
569  ```python
570  from haystack.components.extractors import RegexTextExtractor
571  from haystack.dataclasses import ChatMessage
572  
573  # Using with a string
574  parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">')
575  result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
576  # result: {"captured_text": "github.com/hahahaha"}
577  
578  # Using with ChatMessages
579  messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
580  result = parser.run(text_or_messages=messages)
581  # result: {"captured_text": "github.com/hahahaha"}
582  ```
583  
584  #### __init__
585  
586  ```python
587  __init__(regex_pattern: str) -> None
588  ```
589  
590  Creates an instance of the RegexTextExtractor component.
591  
592  **Parameters:**
593  
594  - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text.
595    The pattern should include a capture group to extract the desired text.
596    Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`.
597  
598  #### to_dict
599  
600  ```python
601  to_dict() -> dict[str, Any]
602  ```
603  
604  Serializes the component to a dictionary.
605  
606  **Returns:**
607  
608  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
609  
610  #### from_dict
611  
612  ```python
613  from_dict(data: dict[str, Any]) -> RegexTextExtractor
614  ```
615  
616  Deserializes the component from a dictionary.
617  
618  **Parameters:**
619  
620  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
621  
622  **Returns:**
623  
624  - <code>RegexTextExtractor</code> – The deserialized component.
625  
626  #### run
627  
628  ```python
629  run(text_or_messages: str | list[ChatMessage]) -> dict[str, str]
630  ```
631  
632  Extracts text from input using the configured regex pattern.
633  
634  **Parameters:**
635  
636  - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through.
637  
638  **Returns:**
639  
640  - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found
641  - `{"captured_text": ""}` if no match is found
642  
643  **Raises:**
644  
645  - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.