Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.22 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: "Extractors"
  3  id: extractors-api
  4  description: "Components to extract specific elements from textual data."
  5  slug: "/extractors-api"
  6  ---
  7  
  8  <a id="image/llm_document_content_extractor"></a>
  9  
 10  ## Module image/llm\_document\_content\_extractor
 11  
 12  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a>
 13  
 14  ### LLMDocumentContentExtractor
 15  
 16  Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model).
 17  
 18  This component converts each input document into an image using the DocumentToImageContent component,
 19  uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured
 20  textual content based on the provided prompt.
 21  
 22  The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt
 23  are passed together to the LLM as a chat message.
 24  
 25  Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These
 26  failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for
 27  debugging or for reprocessing the documents later.
 28  
 29  ### Usage example
 30  ```python
 31  from haystack import Document
 32  from haystack.components.generators.chat import OpenAIChatGenerator
 33  from haystack.components.extractors.image import LLMDocumentContentExtractor
 34  chat_generator = OpenAIChatGenerator()
 35  extractor = LLMDocumentContentExtractor(chat_generator=chat_generator)
 36  documents = [
 37      Document(content="", meta={"file_path": "image.jpg"}),
 38      Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
 39  ]
 40  updated_documents = extractor.run(documents=documents)["documents"]
 41  print(updated_documents)
 42  # [Document(content='Extracted text from image.jpg',
 43  #           meta={'file_path': 'image.jpg'}),
 44  #  ...]
 45  ```
 46  
 47  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a>
 48  
 49  #### LLMDocumentContentExtractor.\_\_init\_\_
 50  
 51  ```python
 52  def __init__(*,
 53               chat_generator: ChatGenerator,
 54               prompt: str = DEFAULT_PROMPT_TEMPLATE,
 55               file_path_meta_field: str = "file_path",
 56               root_path: str | None = None,
 57               detail: Literal["auto", "high", "low"] | None = None,
 58               size: tuple[int, int] | None = None,
 59               raise_on_failure: bool = False,
 60               max_workers: int = 3)
 61  ```
 62  
 63  Initialize the LLMDocumentContentExtractor component.
 64  
 65  **Arguments**:
 66  
 67  - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must
 68  support vision-based input and return a plain text response.
 69  - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables.
 70  The prompt should only contain instructions on how to extract the content of the image-based document.
 71  - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF.
 72  - `root_path`: The root directory path where document files are located. If provided, file paths in
 73  document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 74  - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
 75  This will be passed to chat_generator when processing the images.
 76  - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while
 77  maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
 78  when working with models that have resolution constraints or when transmitting images to remote services.
 79  - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged
 80  and returned.
 81  - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a
 82  ThreadPoolExecutor.
 83  
 84  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a>
 85  
 86  #### LLMDocumentContentExtractor.warm\_up
 87  
 88  ```python
 89  def warm_up()
 90  ```
 91  
 92  Warm up the ChatGenerator if it has a warm_up method.
 93  
 94  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a>
 95  
 96  #### LLMDocumentContentExtractor.to\_dict
 97  
 98  ```python
 99  def to_dict() -> dict[str, Any]
100  ```
101  
102  Serializes the component to a dictionary.
103  
104  **Returns**:
105  
106  Dictionary with serialized data.
107  
108  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a>
109  
110  #### LLMDocumentContentExtractor.from\_dict
111  
112  ```python
113  @classmethod
114  def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor"
115  ```
116  
117  Deserializes the component from a dictionary.
118  
119  **Arguments**:
120  
121  - `data`: Dictionary with serialized data.
122  
123  **Returns**:
124  
125  An instance of the component.
126  
127  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a>
128  
129  #### LLMDocumentContentExtractor.run
130  
131  ```python
132  @component.output_types(documents=list[Document],
133                          failed_documents=list[Document])
134  def run(documents: list[Document]) -> dict[str, list[Document]]
135  ```
136  
137  Run content extraction on a list of image-based documents using a vision-capable LLM.
138  
139  Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's
140  content. If the extraction fails, the document is returned in the `failed_documents` list with metadata
141  describing the failure.
142  
143  **Arguments**:
144  
145  - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata.
146  
147  **Returns**:
148  
149  A dictionary with:
150  - "documents": Successfully processed documents, updated with extracted content.
151  - "failed_documents": Documents that failed processing, annotated with failure metadata.
152  
153  <a id="llm_metadata_extractor"></a>
154  
155  ## Module llm\_metadata\_extractor
156  
157  <a id="llm_metadata_extractor.LLMMetadataExtractor"></a>
158  
159  ### LLMMetadataExtractor
160  
161  Extracts metadata from documents using a Large Language Model (LLM).
162  
163  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
164  
165  This component expects as input a list of documents and a prompt. The prompt should have a variable called
166  `document` that will point to a single document in the list of documents. So to access the content of the document,
167  you can use `{{ document.content }}` in the prompt.
168  
169  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
170  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
171  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
172  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
173  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
174  
175  ```python
176  from haystack import Document
177  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
178  from haystack.components.generators.chat import OpenAIChatGenerator
179  
180  NER_PROMPT = '''
181  -Goal-
182  Given text and a list of entity types, identify all entities of those types from the text.
183  
184  -Steps-
185  1. Identify all entities. For each identified entity, extract the following information:
186  - entity: Name of the entity
187  - entity_type: One of the following types: [organization, product, service, industry]
188  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
189  
190  2. Return output in a single list with all the entities identified in steps 1.
191  
192  -Examples-
193  ######################
194  Example 1:
195  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
196  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
197  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
198  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
199  base and high cross-border usage.
200  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
201  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
202  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
203  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
204  agreement with Emirates Skywards.
205  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
206  issuers are equally
207  ------------------------
208  output:
209  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
210  #############################
211  -Real Data-
212  ######################
213  entity_types: [company, organization, person, country, product, service]
214  text: {{ document.content }}
215  ######################
216  output:
217  '''
218  
219  docs = [
220      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
221      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
222  ]
223  
224  chat_generator = OpenAIChatGenerator(
225      generation_kwargs={
226          "max_completion_tokens": 500,
227          "temperature": 0.0,
228          "seed": 0,
229          "response_format": {"type": "json_object"},
230      },
231      max_retries=1,
232      timeout=60.0,
233  )
234  
235  extractor = LLMMetadataExtractor(
236      prompt=NER_PROMPT,
237      chat_generator=generator,
238      expected_keys=["entities"],
239      raise_on_failure=False,
240  )
241  
242  extractor.warm_up()
243  extractor.run(documents=docs)
244  >> {'documents': [
245      Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
246      meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
247            {'entity': 'Haystack', 'entity_type': 'product'}]}),
248      Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
249      meta: {'entities': [
250              {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
251              {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
252              ]})
253         ]
254      'failed_documents': []
255     }
256  >>
257  ```
258  
259  <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a>
260  
261  #### LLMMetadataExtractor.\_\_init\_\_
262  
263  ```python
264  def __init__(prompt: str,
265               chat_generator: ChatGenerator,
266               expected_keys: list[str] | None = None,
267               page_range: list[str | int] | None = None,
268               raise_on_failure: bool = False,
269               max_workers: int = 3)
270  ```
271  
272  Initializes the LLMMetadataExtractor.
273  
274  **Arguments**:
275  
276  - `prompt`: The prompt to be used for the LLM.
277  - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work,
278  the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
279  should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
280  - `expected_keys`: The keys expected in the JSON output from the LLM.
281  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
282  metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
283  ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
284  If None, metadata will be extracted from the entire document for each document in the documents list.
285  This parameter is optional and can be overridden in the `run` method.
286  - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or
287  validation of the JSON output.
288  - `max_workers`: The maximum number of workers to use in the thread pool executor.
289  
290  <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a>
291  
292  #### LLMMetadataExtractor.warm\_up
293  
294  ```python
295  def warm_up()
296  ```
297  
298  Warm up the LLM provider component.
299  
300  <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a>
301  
302  #### LLMMetadataExtractor.to\_dict
303  
304  ```python
305  def to_dict() -> dict[str, Any]
306  ```
307  
308  Serializes the component to a dictionary.
309  
310  **Returns**:
311  
312  Dictionary with serialized data.
313  
314  <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a>
315  
316  #### LLMMetadataExtractor.from\_dict
317  
318  ```python
319  @classmethod
320  def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor"
321  ```
322  
323  Deserializes the component from a dictionary.
324  
325  **Arguments**:
326  
327  - `data`: Dictionary with serialized data.
328  
329  **Returns**:
330  
331  An instance of the component.
332  
333  <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a>
334  
335  #### LLMMetadataExtractor.run
336  
337  ```python
338  @component.output_types(documents=list[Document],
339                          failed_documents=list[Document])
340  def run(documents: list[Document], page_range: list[str | int] | None = None)
341  ```
342  
343  Extract metadata from documents using a Large Language Model.
344  
345  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
346  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
347  extracted from the entire document if `page_range` is not provided.
348  
349  The original documents will be returned  updated with the extracted metadata.
350  
351  **Arguments**:
352  
353  - `documents`: List of documents to extract metadata from.
354  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
355  metadata from the first and third pages of each document. It also accepts printable range
356  strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
357  11, 12.
358  If None, metadata will be extracted from the entire document for each document in the
359  documents list.
360  
361  **Returns**:
362  
363  A dictionary with the keys:
364  - "documents": A list of documents that were successfully updated with the extracted metadata.
365  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
366  "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
367  re-run with the extractor to extract metadata.
368  
369  <a id="named_entity_extractor"></a>
370  
371  ## Module named\_entity\_extractor
372  
373  <a id="named_entity_extractor.NamedEntityExtractorBackend"></a>
374  
375  ### NamedEntityExtractorBackend
376  
377  NLP backend to use for Named Entity Recognition.
378  
379  <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a>
380  
381  #### HUGGING\_FACE
382  
383  Uses an Hugging Face model and pipeline.
384  
385  <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a>
386  
387  #### SPACY
388  
389  Uses a spaCy model and pipeline.
390  
391  <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a>
392  
393  #### NamedEntityExtractorBackend.from\_str
394  
395  ```python
396  @staticmethod
397  def from_str(string: str) -> "NamedEntityExtractorBackend"
398  ```
399  
400  Convert a string to a NamedEntityExtractorBackend enum.
401  
402  <a id="named_entity_extractor.NamedEntityAnnotation"></a>
403  
404  ### NamedEntityAnnotation
405  
406  Describes a single NER annotation.
407  
408  **Arguments**:
409  
410  - `entity`: Entity label.
411  - `start`: Start index of the entity in the document.
412  - `end`: End index of the entity in the document.
413  - `score`: Score calculated by the model.
414  
415  <a id="named_entity_extractor.NamedEntityExtractor"></a>
416  
417  ### NamedEntityExtractor
418  
419  Annotates named entities in a collection of documents.
420  
421  The component supports two backends: Hugging Face and spaCy. The
422  former can be used with any sequence classification model from the
423  [Hugging Face model hub](https://huggingface.co/models), while the
424  latter can be used with any [spaCy model](https://spacy.io/models)
425  that contains an NER component. Annotations are stored as metadata
426  in the documents.
427  
428  Usage example:
429  ```python
430  from haystack import Document
431  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
432  
433  documents = [
434      Document(content="I'm Merlin, the happy pig!"),
435      Document(content="My name is Clara and I live in Berkeley, California."),
436  ]
437  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
438  extractor.warm_up()
439  results = extractor.run(documents=documents)["documents"]
440  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
441  print(annotations)
442  ```
443  
444  <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a>
445  
446  #### NamedEntityExtractor.\_\_init\_\_
447  
448  ```python
449  def __init__(
450      *,
451      backend: str | NamedEntityExtractorBackend,
452      model: str,
453      pipeline_kwargs: dict[str, Any] | None = None,
454      device: ComponentDevice | None = None,
455      token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"],
456                                                 strict=False)
457  ) -> None
458  ```
459  
460  Create a Named Entity extractor component.
461  
462  **Arguments**:
463  
464  - `backend`: Backend to use for NER.
465  - `model`: Name of the model or a path to the model on
466  the local disk. Dependent on the backend.
467  - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The
468  pipeline can override these arguments. Dependent on the backend.
469  - `device`: The device on which the model is loaded. If `None`,
470  the default device is automatically selected. If a
471  device/device map is specified in `pipeline_kwargs`,
472  it overrides this parameter (only applicable to the
473  HuggingFace backend).
474  - `token`: The API token to download private models from Hugging Face.
475  
476  <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a>
477  
478  #### NamedEntityExtractor.warm\_up
479  
480  ```python
481  def warm_up()
482  ```
483  
484  Initialize the component.
485  
486  **Raises**:
487  
488  - `ComponentError`: If the backend fails to initialize successfully.
489  
490  <a id="named_entity_extractor.NamedEntityExtractor.run"></a>
491  
492  #### NamedEntityExtractor.run
493  
494  ```python
495  @component.output_types(documents=list[Document])
496  def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
497  ```
498  
499  Annotate named entities in each document and store the annotations in the document's metadata.
500  
501  **Arguments**:
502  
503  - `documents`: Documents to process.
504  - `batch_size`: Batch size used for processing the documents.
505  
506  **Raises**:
507  
508  - `ComponentError`: If the backend fails to process a document.
509  
510  **Returns**:
511  
512  Processed documents.
513  
514  <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a>
515  
516  #### NamedEntityExtractor.to\_dict
517  
518  ```python
519  def to_dict() -> dict[str, Any]
520  ```
521  
522  Serializes the component to a dictionary.
523  
524  **Returns**:
525  
526  Dictionary with serialized data.
527  
528  <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a>
529  
530  #### NamedEntityExtractor.from\_dict
531  
532  ```python
533  @classmethod
534  def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor"
535  ```
536  
537  Deserializes the component from a dictionary.
538  
539  **Arguments**:
540  
541  - `data`: Dictionary to deserialize from.
542  
543  **Returns**:
544  
545  Deserialized component.
546  
547  <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a>
548  
549  #### NamedEntityExtractor.initialized
550  
551  ```python
552  @property
553  def initialized() -> bool
554  ```
555  
556  Returns if the extractor is ready to annotate text.
557  
558  <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a>
559  
560  #### NamedEntityExtractor.get\_stored\_annotations
561  
562  ```python
563  @classmethod
564  def get_stored_annotations(
565          cls, document: Document) -> list[NamedEntityAnnotation] | None
566  ```
567  
568  Returns the document's named entity annotations stored in its metadata, if any.
569  
570  **Arguments**:
571  
572  - `document`: Document whose annotations are to be fetched.
573  
574  **Returns**:
575  
576  The stored annotations.
577  
578  <a id="regex_text_extractor"></a>
579  
580  ## Module regex\_text\_extractor
581  
582  <a id="regex_text_extractor.RegexTextExtractor"></a>
583  
584  ### RegexTextExtractor
585  
586  Extracts text from chat message or string input using a regex pattern.
587  
588  RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern.
589  It can be configured to search through all messages or only the last message in a list of ChatMessages.
590  
591  ### Usage example
592  
593  ```python
594  from haystack.components.extractors import RegexTextExtractor
595  from haystack.dataclasses import ChatMessage
596  
597  # Using with a string
598  parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">')
599  result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>')
600  # result: {"captured_text": "github.com/hahahaha"}
601  
602  # Using with ChatMessages
603  messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')]
604  result = parser.run(text_or_messages=messages)
605  # result: {"captured_text": "github.com/hahahaha"}
606  ```
607  
608  <a id="regex_text_extractor.RegexTextExtractor.__init__"></a>
609  
610  #### RegexTextExtractor.\_\_init\_\_
611  
612  ```python
613  def __init__(regex_pattern: str)
614  ```
615  
616  Creates an instance of the RegexTextExtractor component.
617  
618  **Arguments**:
619  
620  - `regex_pattern`: The regular expression pattern used to extract text.
621  The pattern should include a capture group to extract the desired text.
622  Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`.
623  
624  <a id="regex_text_extractor.RegexTextExtractor.run"></a>
625  
626  #### RegexTextExtractor.run
627  
628  ```python
629  @component.output_types(captured_text=str)
630  def run(text_or_messages: str | list[ChatMessage]) -> dict[str, str]
631  ```
632  
633  Extracts text from input using the configured regex pattern.
634  
635  **Arguments**:
636  
637  - `text_or_messages`: Either a string or a list of ChatMessage objects to search through.
638  
639  **Raises**:
640  
641  - `None`: - ValueError: if receiving a list the last element is not a ChatMessage instance.
642  
643  **Returns**:
644  
645  - `{"captured_text": "matched text"}` if a match is found
646  - `{"captured_text": ""}` if no match is found
647