Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: "Extractors"
  3  id: extractors-api
  4  description: "Components to extract specific elements from textual data."
  5  slug: "/extractors-api"
  6  ---
  7  
  8  
  9  ## image/llm_document_content_extractor
 10  
 11  ### LLMDocumentContentExtractor
 12  
 13  Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM.
 14  
 15  One prompt and one LLM call per document. The component converts each document to an image via
 16  DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables.
 17  
 18  Response handling:
 19  
 20  - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content.
 21  - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content.
 22  - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is
 23    written to content and all other keys are merged into the document's metadata.
 24  
 25  The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}`
 26  in `generation_kwargs`).
 27  
 28  Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata.
 29  
 30  ### Usage example
 31  
 32  ```python
 33  from haystack import Document
 34  from haystack.components.generators.chat import OpenAIChatGenerator
 35  from haystack.components.extractors.image import LLMDocumentContentExtractor
 36  
 37  prompt = """
 38  Extract the content from the provided image.
 39  Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'.
 40  No markdown, no code fence, only raw JSON.
 41  
 42  Extract metadata about the image like source of the image, date of creation, etc. if you can.
 43  Return this metadata as additional key-value pairs in the same JSON object.
 44  """
 45  
 46  chat_generator = OpenAIChatGenerator()
 47  extractor = LLMDocumentContentExtractor(
 48      chat_generator=chat_generator,
 49      generation_kwargs={
 50          "response_format": {
 51              "type": "json_schema",
 52              "json_schema": {
 53                  "name": "entity_extraction",
 54                  "schema": {
 55                      "type": "object",
 56                      "properties": {
 57                          "document_content": {"type": "string"},
 58                          "author": {"type": "string"},
 59                          "date": {"type": "string"},
 60                          "document_type": {"type": "string"},
 61                          "title": {"type": "string"},
 62                      },
 63                      "additionalProperties": False,
 64                  },
 65              },
 66          }
 67      }
 68  )
 69  documents = [
 70      Document(content="", meta={"file_path": "image.jpg"}),
 71      Document(content="", meta={"file_path": "document.pdf", "page_number": 1})
 72  ]
 73  result = extractor.run(documents=documents)
 74  updated_documents = result["documents"]
 75  ```
 76  
 77  #### __init__
 78  
 79  ```python
 80  __init__(
 81      *,
 82      chat_generator: ChatGenerator,
 83      prompt: str = DEFAULT_PROMPT_TEMPLATE,
 84      file_path_meta_field: str = "file_path",
 85      root_path: str | None = None,
 86      detail: Literal["auto", "high", "low"] | None = None,
 87      size: tuple[int, int] | None = None,
 88      raise_on_failure: bool = False,
 89      max_workers: int = 3
 90  )
 91  ```
 92  
 93  Initialize the LLMDocumentContentExtractor component.
 94  
 95  **Parameters:**
 96  
 97  - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON
 98    (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`).
 99  - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables.
100  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
101  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
102    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
103  - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
104  - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio.
105  - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned.
106  - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls.
107  
108  #### warm_up
109  
110  ```python
111  warm_up()
112  ```
113  
114  Warm up the ChatGenerator if it has a warm_up method.
115  
116  #### to_dict
117  
118  ```python
119  to_dict() -> dict[str, Any]
120  ```
121  
122  Serializes the component to a dictionary.
123  
124  **Returns:**
125  
126  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
127  
128  #### from_dict
129  
130  ```python
131  from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor
132  ```
133  
134  Deserializes the component from a dictionary.
135  
136  **Parameters:**
137  
138  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
139  
140  **Returns:**
141  
142  - <code>LLMDocumentContentExtractor</code> – An instance of the component.
143  
144  #### run
145  
146  ```python
147  run(documents: list[Document]) -> dict[str, list[Document]]
148  ```
149  
150  Run extraction on image-based documents. One LLM call per document.
151  
152  **Parameters:**
153  
154  - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata.
155  
156  **Returns:**
157  
158  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata).
159  
160  ## llm_metadata_extractor
161  
162  ### LLMMetadataExtractor
163  
164  Extracts metadata from documents using a Large Language Model (LLM).
165  
166  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
167  
168  This component expects as input a list of documents and a prompt. The prompt should have a variable called
169  `document` that will point to a single document in the list of documents. So to access the content of the document,
170  you can use `{{ document.content }}` in the prompt.
171  
172  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
173  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
174  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
175  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
176  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
177  
178  ```python
179  from haystack import Document
180  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
181  from haystack.components.generators.chat import OpenAIChatGenerator
182  
183  NER_PROMPT = '''
184  -Goal-
185  Given text and a list of entity types, identify all entities of those types from the text.
186  
187  -Steps-
188  1. Identify all entities. For each identified entity, extract the following information:
189  - entity: Name of the entity
190  - entity_type: One of the following types: [organization, product, service, industry]
191  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
192  
193  2. Return output in a single list with all the entities identified in steps 1.
194  
195  -Examples-
196  ######################
197  Example 1:
198  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
199  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
200  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
201  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
202  base and high cross-border usage.
203  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
204  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
205  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
206  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
207  agreement with Emirates Skywards.
208  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
209  issuers are equally
210  ------------------------
211  output:
212  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
213  #############################
214  -Real Data-
215  ######################
216  entity_types: [company, organization, person, country, product, service]
217  text: {{ document.content }}
218  ######################
219  output:
220  '''
221  
222  docs = [
223      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
224      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
225  ]
226  
227  chat_generator = OpenAIChatGenerator(
228      generation_kwargs={
229          "max_completion_tokens": 500,
230          "temperature": 0.0,
231          "seed": 0,
232          "response_format": {
233              "type": "json_schema",
234              "json_schema": {
235                  "name": "entity_extraction",
236                  "schema": {
237                      "type": "object",
238                      "properties": {
239                          "entities": {
240                              "type": "array",
241                              "items": {
242                                  "type": "object",
243                                  "properties": {
244                                      "entity": {"type": "string"},
245                                      "entity_type": {"type": "string"}
246                                  },
247                                  "required": ["entity", "entity_type"],
248                                  "additionalProperties": False
249                              }
250                          }
251                      },
252                      "required": ["entities"],
253                      "additionalProperties": False
254                  }
255              }
256          },
257      },
258      max_retries=1,
259      timeout=60.0,
260  )
261  
262  extractor = LLMMetadataExtractor(
263      prompt=NER_PROMPT,
264      chat_generator=generator,
265      expected_keys=["entities"],
266      raise_on_failure=False,
267  )
268  
269  extractor.run(documents=docs)
270  >> {'documents': [
271      Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
272      meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
273            {'entity': 'Haystack', 'entity_type': 'product'}]}),
274      Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
275      meta: {'entities': [
276              {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
277              {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
278              ]})
279         ]
280      'failed_documents': []
281     }
282  >>
283  ```
284  
285  #### __init__
286  
287  ```python
288  __init__(
289      prompt: str,
290      chat_generator: ChatGenerator,
291      expected_keys: list[str] | None = None,
292      page_range: list[str | int] | None = None,
293      raise_on_failure: bool = False,
294      max_workers: int = 3,
295  )
296  ```
297  
298  Initializes the LLMMetadataExtractor.
299  
300  **Parameters:**
301  
302  - **prompt** (<code>str</code>) – The prompt to be used for the LLM.
303  - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work,
304    the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
305    should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
306  - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM.
307  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
308    metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
309    ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
310    If None, metadata will be extracted from the entire document for each document in the documents list.
311    This parameter is optional and can be overridden in the `run` method.
312  - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or
313    validation of the JSON output.
314  - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor.
315  
316  #### warm_up
317  
318  ```python
319  warm_up()
320  ```
321  
322  Warm up the LLM provider component.
323  
324  #### to_dict
325  
326  ```python
327  to_dict() -> dict[str, Any]
328  ```
329  
330  Serializes the component to a dictionary.
331  
332  **Returns:**
333  
334  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
335  
336  #### from_dict
337  
338  ```python
339  from_dict(data: dict[str, Any]) -> LLMMetadataExtractor
340  ```
341  
342  Deserializes the component from a dictionary.
343  
344  **Parameters:**
345  
346  - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data.
347  
348  **Returns:**
349  
350  - <code>LLMMetadataExtractor</code> – An instance of the component.
351  
352  #### run
353  
354  ```python
355  run(documents: list[Document], page_range: list[str | int] | None = None)
356  ```
357  
358  Extract metadata from documents using a Large Language Model.
359  
360  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
361  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
362  extracted from the entire document if `page_range` is not provided.
363  
364  The original documents will be returned updated with the extracted metadata.
365  
366  **Parameters:**
367  
368  - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from.
369  - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
370    metadata from the first and third pages of each document. It also accepts printable range
371    strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
372    11, 12.
373    If None, metadata will be extracted from the entire document for each document in the
374    documents list.
375  
376  **Returns:**
377  
378  - – A dictionary with the keys:
379  - "documents": A list of documents that were successfully updated with the extracted metadata.
380  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
381    "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
382    re-run with the extractor to extract metadata.
383  
384  ## named_entity_extractor
385  
386  ### NamedEntityExtractorBackend
387  
388  Bases: <code>Enum</code>
389  
390  NLP backend to use for Named Entity Recognition.
391  
392  #### from_str
393  
394  ```python
395  from_str(string: str) -> NamedEntityExtractorBackend
396  ```
397  
398  Convert a string to a NamedEntityExtractorBackend enum.
399  
400  ### NamedEntityAnnotation
401  
402  Describes a single NER annotation.
403  
404  **Parameters:**
405  
406  - **entity** (<code>str</code>) – Entity label.
407  - **start** (<code>int</code>) – Start index of the entity in the document.
408  - **end** (<code>int</code>) – End index of the entity in the document.
409  - **score** (<code>float | None</code>) – Score calculated by the model.
410  
411  ### NamedEntityExtractor
412  
413  Annotates named entities in a collection of documents.
414  
415  The component supports two backends: Hugging Face and spaCy. The
416  former can be used with any sequence classification model from the
417  [Hugging Face model hub](https://huggingface.co/models), while the
418  latter can be used with any [spaCy model](https://spacy.io/models)
419  that contains an NER component. Annotations are stored as metadata
420  in the documents.
421  
422  Usage example:
423  
424  ```python
425  from haystack import Document
426  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
427  
428  documents = [
429      Document(content="I'm Merlin, the happy pig!"),
430      Document(content="My name is Clara and I live in Berkeley, California."),
431  ]
432  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
433  results = extractor.run(documents=documents)["documents"]
434  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
435  print(annotations)
436  ```
437  
438  #### __init__
439  
440  ```python
441  __init__(
442      *,
443      backend: str | NamedEntityExtractorBackend,
444      model: str,
445      pipeline_kwargs: dict[str, Any] | None = None,
446      device: ComponentDevice | None = None,
447      token: Secret | None = Secret.from_env_var(
448          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
449      )
450  ) -> None
451  ```
452  
453  Create a Named Entity extractor component.
454  
455  **Parameters:**
456  
457  - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER.
458  - **model** (<code>str</code>) – Name of the model or a path to the model on
459    the local disk. Dependent on the backend.
460  - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The
461    pipeline can override these arguments. Dependent on the backend.
462  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`,
463    the default device is automatically selected. If a
464    device/device map is specified in `pipeline_kwargs`,
465    it overrides this parameter (only applicable to the
466    HuggingFace backend).
467  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
468  
469  #### warm_up
470  
471  ```python
472  warm_up()
473  ```
474  
475  Initialize the component.
476  
477  **Raises:**
478  
479  - <code>ComponentError</code> – If the backend fails to initialize successfully.
480  
481  #### run
482  
483  ```python
484  run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
485  ```
486  
487  Annotate named entities in each document and store the annotations in the document's metadata.
488  
489  **Parameters:**
490  
491  - **documents** (<code>list\[Document\]</code>) – Documents to process.
492  - **batch_size** (<code>int</code>) – Batch size used for processing the documents.
493  
494  **Returns:**
495  
496  - <code>dict\[str, Any\]</code> – Processed documents.
497  
498  **Raises:**
499  
500  - <code>ComponentError</code> – If the backend fails to process a document.
501  
502  #### to_dict
503  
504  ```python
505  to_dict() -> dict[str, Any]
506  ```
507  
508  Serializes the component to a dictionary.
509  
510  **Returns:**
511  
512  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
513  
514  #### from_dict
515  
516  ```python
517  from_dict(data: dict[str, Any]) -> NamedEntityExtractor
518  ```
519  
520  Deserializes the component from a dictionary.
521  
522  **Parameters:**
523  
524  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
525  
526  **Returns:**
527  
528  - <code>NamedEntityExtractor</code> – Deserialized component.
529  
530  #### initialized
531  
532  ```python
533  initialized: bool
534  ```
535  
536  Returns if the extractor is ready to annotate text.
537  
538  #### get_stored_annotations
539  
540  ```python
541  get_stored_annotations(
542      document: Document,
543  ) -> list[NamedEntityAnnotation] | None
544  ```
545  
546  Returns the document's named entity annotations stored in its metadata, if any.
547  
548  **Parameters:**
549  
550  - **document** (<code>Document</code>) – Document whose annotations are to be fetched.
551  
552  **Returns:**
553  
554  - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations.
555  
556  ## regex_text_extractor
557  
558  ### RegexTextExtractor
559  
560  Extracts text from chat message or string input using a regex pattern.
561  
562  RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
563  It can be configured to search through all messages or only the last message in a list of ChatMessages.
564  
565  ### Usage example
566  
567  ```python
568  from haystack.components.extractors import RegexTextExtractor
569  from haystack.dataclasses import ChatMessage
570  
571  # Using with a string
572  parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">')
573  result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
574  # result: {"captured_text": "github.com/hahahaha"}
575  
576  # Using with ChatMessages
577  messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
578  result = parser.run(text_or_messages=messages)
579  # result: {"captured_text": "github.com/hahahaha"}
580  ```
581  
582  #### __init__
583  
584  ```python
585  __init__(regex_pattern: str)
586  ```
587  
588  Creates an instance of the RegexTextExtractor component.
589  
590  **Parameters:**
591  
592  - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text.
593    The pattern should include a capture group to extract the desired text.
594    Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`.
595  
596  #### to_dict
597  
598  ```python
599  to_dict() -> dict[str, Any]
600  ```
601  
602  Serializes the component to a dictionary.
603  
604  **Returns:**
605  
606  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
607  
608  #### from_dict
609  
610  ```python
611  from_dict(data: dict[str, Any]) -> RegexTextExtractor
612  ```
613  
614  Deserializes the component from a dictionary.
615  
616  **Parameters:**
617  
618  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
619  
620  **Returns:**
621  
622  - <code>RegexTextExtractor</code> – The deserialized component.
623  
624  #### run
625  
626  ```python
627  run(text_or_messages: str | list[ChatMessage]) -> dict[str, str]
628  ```
629  
630  Extracts text from input using the configured regex pattern.
631  
632  **Parameters:**
633  
634  - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through.
635  
636  **Returns:**
637  
638  - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found
639  - `{"captured_text": ""}` if no match is found
640  
641  **Raises:**
642  
643  - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.