Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.18 / haystack-api / extractors_api.md
extractors_api.md
  1  ---
  2  title: Extractors
  3  id: extractors-api
  4  description: Extracts predefined entities out of a piece of text.
  5  slug: "/extractors-api"
  6  ---
  7  
  8  <a id="named_entity_extractor"></a>
  9  
 10  # Module named\_entity\_extractor
 11  
 12  <a id="named_entity_extractor.NamedEntityExtractorBackend"></a>
 13  
 14  ## NamedEntityExtractorBackend
 15  
 16  NLP backend to use for Named Entity Recognition.
 17  
 18  <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a>
 19  
 20  #### HUGGING\_FACE
 21  
 22  Uses an Hugging Face model and pipeline.
 23  
 24  <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a>
 25  
 26  #### SPACY
 27  
 28  Uses a spaCy model and pipeline.
 29  
 30  <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a>
 31  
 32  #### NamedEntityExtractorBackend.from\_str
 33  
 34  ```python
 35  @staticmethod
 36  def from_str(string: str) -> "NamedEntityExtractorBackend"
 37  ```
 38  
 39  Convert a string to a NamedEntityExtractorBackend enum.
 40  
 41  <a id="named_entity_extractor.NamedEntityAnnotation"></a>
 42  
 43  ## NamedEntityAnnotation
 44  
 45  Describes a single NER annotation.
 46  
 47  **Arguments**:
 48  
 49  - `entity`: Entity label.
 50  - `start`: Start index of the entity in the document.
 51  - `end`: End index of the entity in the document.
 52  - `score`: Score calculated by the model.
 53  
 54  <a id="named_entity_extractor.NamedEntityExtractor"></a>
 55  
 56  ## NamedEntityExtractor
 57  
 58  Annotates named entities in a collection of documents.
 59  
 60  The component supports two backends: Hugging Face and spaCy. The
 61  former can be used with any sequence classification model from the
 62  [Hugging Face model hub](https://huggingface.co/models), while the
 63  latter can be used with any [spaCy model](https://spacy.io/models)
 64  that contains an NER component. Annotations are stored as metadata
 65  in the documents.
 66  
 67  Usage example:
 68  ```python
 69  from haystack import Document
 70  from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor
 71  
 72  documents = [
 73      Document(content="I'm Merlin, the happy pig!"),
 74      Document(content="My name is Clara and I live in Berkeley, California."),
 75  ]
 76  extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER")
 77  extractor.warm_up()
 78  results = extractor.run(documents=documents)["documents"]
 79  annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results]
 80  print(annotations)
 81  ```
 82  
 83  <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a>
 84  
 85  #### NamedEntityExtractor.\_\_init\_\_
 86  
 87  ```python
 88  def __init__(
 89      *,
 90      backend: Union[str, NamedEntityExtractorBackend],
 91      model: str,
 92      pipeline_kwargs: Optional[dict[str, Any]] = None,
 93      device: Optional[ComponentDevice] = None,
 94      token: Optional[Secret] = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"],
 95                                                    strict=False)
 96  ) -> None
 97  ```
 98  
 99  Create a Named Entity extractor component.
100  
101  **Arguments**:
102  
103  - `backend`: Backend to use for NER.
104  - `model`: Name of the model or a path to the model on
105  the local disk. Dependent on the backend.
106  - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The
107  pipeline can override these arguments. Dependent on the backend.
108  - `device`: The device on which the model is loaded. If `None`,
109  the default device is automatically selected. If a
110  device/device map is specified in `pipeline_kwargs`,
111  it overrides this parameter (only applicable to the
112  HuggingFace backend).
113  - `token`: The API token to download private models from Hugging Face.
114  
115  <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a>
116  
117  #### NamedEntityExtractor.warm\_up
118  
119  ```python
120  def warm_up()
121  ```
122  
123  Initialize the component.
124  
125  **Raises**:
126  
127  - `ComponentError`: If the backend fails to initialize successfully.
128  
129  <a id="named_entity_extractor.NamedEntityExtractor.run"></a>
130  
131  #### NamedEntityExtractor.run
132  
133  ```python
134  @component.output_types(documents=list[Document])
135  def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any]
136  ```
137  
138  Annotate named entities in each document and store the annotations in the document's metadata.
139  
140  **Arguments**:
141  
142  - `documents`: Documents to process.
143  - `batch_size`: Batch size used for processing the documents.
144  
145  **Raises**:
146  
147  - `ComponentError`: If the backend fails to process a document.
148  
149  **Returns**:
150  
151  Processed documents.
152  
153  <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a>
154  
155  #### NamedEntityExtractor.to\_dict
156  
157  ```python
158  def to_dict() -> dict[str, Any]
159  ```
160  
161  Serializes the component to a dictionary.
162  
163  **Returns**:
164  
165  Dictionary with serialized data.
166  
167  <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a>
168  
169  #### NamedEntityExtractor.from\_dict
170  
171  ```python
172  @classmethod
173  def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor"
174  ```
175  
176  Deserializes the component from a dictionary.
177  
178  **Arguments**:
179  
180  - `data`: Dictionary to deserialize from.
181  
182  **Returns**:
183  
184  Deserialized component.
185  
186  <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a>
187  
188  #### NamedEntityExtractor.initialized
189  
190  ```python
191  @property
192  def initialized() -> bool
193  ```
194  
195  Returns if the extractor is ready to annotate text.
196  
197  <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a>
198  
199  #### NamedEntityExtractor.get\_stored\_annotations
200  
201  ```python
202  @classmethod
203  def get_stored_annotations(
204          cls, document: Document) -> Optional[list[NamedEntityAnnotation]]
205  ```
206  
207  Returns the document's named entity annotations stored in its metadata, if any.
208  
209  **Arguments**:
210  
211  - `document`: Document whose annotations are to be fetched.
212  
213  **Returns**:
214  
215  The stored annotations.
216  
217  <a id="llm_metadata_extractor"></a>
218  
219  # Module llm\_metadata\_extractor
220  
221  <a id="llm_metadata_extractor.LLMMetadataExtractor"></a>
222  
223  ## LLMMetadataExtractor
224  
225  Extracts metadata from documents using a Large Language Model (LLM).
226  
227  The metadata is extracted by providing a prompt to an LLM that generates the metadata.
228  
229  This component expects as input a list of documents and a prompt. The prompt should have a variable called
230  `document` that will point to a single document in the list of documents. So to access the content of the document,
231  you can use `{{ document.content }}` in the prompt.
232  
233  The component will run the LLM on each document in the list and extract metadata from the document. The metadata
234  will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document
235  will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and
236  `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to
237  extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt.
238  
239  ```python
240  from haystack import Document
241  from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
242  from haystack.components.generators.chat import OpenAIChatGenerator
243  
244  NER_PROMPT = '''
245  -Goal-
246  Given text and a list of entity types, identify all entities of those types from the text.
247  
248  -Steps-
249  1. Identify all entities. For each identified entity, extract the following information:
250  - entity: Name of the entity
251  - entity_type: One of the following types: [organization, product, service, industry]
252  Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
253  
254  2. Return output in a single list with all the entities identified in steps 1.
255  
256  -Examples-
257  ######################
258  Example 1:
259  entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
260  text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
261  10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
262  our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
263  base and high cross-border usage.
264  We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
265  with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
266  Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
267  United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
268  agreement with Emirates Skywards.
269  And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
270  issuers are equally
271  ------------------------
272  output:
273  {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
274  #############################
275  -Real Data-
276  ######################
277  entity_types: [company, organization, person, country, product, service]
278  text: {{ document.content }}
279  ######################
280  output:
281  '''
282  
283  docs = [
284      Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
285      Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
286  ]
287  
288  chat_generator = OpenAIChatGenerator(
289      generation_kwargs={
290          "max_tokens": 500,
291          "temperature": 0.0,
292          "seed": 0,
293          "response_format": {"type": "json_object"},
294      },
295      max_retries=1,
296      timeout=60.0,
297  )
298  
299  extractor = LLMMetadataExtractor(
300      prompt=NER_PROMPT,
301      chat_generator=generator,
302      expected_keys=["entities"],
303      raise_on_failure=False,
304  )
305  
306  extractor.warm_up()
307  extractor.run(documents=docs)
308  >> {'documents': [
309      Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
310      meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
311            {'entity': 'Haystack', 'entity_type': 'product'}]}),
312      Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
313      meta: {'entities': [
314              {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
315              {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
316              ]})
317         ]
318      'failed_documents': []
319     }
320  >>
321  ```
322  
323  <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a>
324  
325  #### LLMMetadataExtractor.\_\_init\_\_
326  
327  ```python
328  def __init__(prompt: str,
329               chat_generator: ChatGenerator,
330               expected_keys: Optional[list[str]] = None,
331               page_range: Optional[list[Union[str, int]]] = None,
332               raise_on_failure: bool = False,
333               max_workers: int = 3)
334  ```
335  
336  Initializes the LLMMetadataExtractor.
337  
338  **Arguments**:
339  
340  - `prompt`: The prompt to be used for the LLM.
341  - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work,
342  the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you
343  should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
344  - `expected_keys`: The keys expected in the JSON output from the LLM.
345  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
346  metadata from the first and third pages of each document. It also accepts printable range strings, e.g.:
347  ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12.
348  If None, metadata will be extracted from the entire document for each document in the documents list.
349  This parameter is optional and can be overridden in the `run` method.
350  - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or
351  validation of the JSON output.
352  - `max_workers`: The maximum number of workers to use in the thread pool executor.
353  
354  <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a>
355  
356  #### LLMMetadataExtractor.warm\_up
357  
358  ```python
359  def warm_up()
360  ```
361  
362  Warm up the LLM provider component.
363  
364  <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a>
365  
366  #### LLMMetadataExtractor.to\_dict
367  
368  ```python
369  def to_dict() -> dict[str, Any]
370  ```
371  
372  Serializes the component to a dictionary.
373  
374  **Returns**:
375  
376  Dictionary with serialized data.
377  
378  <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a>
379  
380  #### LLMMetadataExtractor.from\_dict
381  
382  ```python
383  @classmethod
384  def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor"
385  ```
386  
387  Deserializes the component from a dictionary.
388  
389  **Arguments**:
390  
391  - `data`: Dictionary with serialized data.
392  
393  **Returns**:
394  
395  An instance of the component.
396  
397  <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a>
398  
399  #### LLMMetadataExtractor.run
400  
401  ```python
402  @component.output_types(documents=list[Document],
403                          failed_documents=list[Document])
404  def run(documents: list[Document],
405          page_range: Optional[list[Union[str, int]]] = None)
406  ```
407  
408  Extract metadata from documents using a Large Language Model.
409  
410  If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component
411  will split the documents into pages and extract metadata from the specified range of pages. The metadata will be
412  extracted from the entire document if `page_range` is not provided.
413  
414  The original documents will be returned  updated with the extracted metadata.
415  
416  **Arguments**:
417  
418  - `documents`: List of documents to extract metadata from.
419  - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract
420  metadata from the first and third pages of each document. It also accepts printable range
421  strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,
422  11, 12.
423  If None, metadata will be extracted from the entire document for each document in the
424  documents list.
425  
426  **Returns**:
427  
428  A dictionary with the keys:
429  - "documents": A list of documents that were successfully updated with the extracted metadata.
430  - "failed_documents": A list of documents that failed to extract metadata. These documents will have
431  "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be
432  re-run with the extractor to extract metadata.
433  
434  <a id="image/llm_document_content_extractor"></a>
435  
436  # Module image/llm\_document\_content\_extractor
437  
438  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a>
439  
440  ## LLMDocumentContentExtractor
441  
442  Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model).
443  
444  This component converts each input document into an image using the DocumentToImageContent component,
445  uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured
446  textual content based on the provided prompt.
447  
448  The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt
449  are passed together to the LLM as a chat message.
450  
451  Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These
452  failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for
453  debugging or for reprocessing the documents later.
454  
455  ### Usage example
456  ```python
457  from haystack import Document
458  from haystack.components.generators.chat import OpenAIChatGenerator
459  from haystack.components.extractors.image import LLMDocumentContentExtractor
460  chat_generator = OpenAIChatGenerator()
461  extractor = LLMDocumentContentExtractor(chat_generator=chat_generator)
462  documents = [
463      Document(content="", meta={"file_path": "image.jpg"}),
464      Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
465  ]
466  updated_documents = extractor.run(documents=documents)["documents"]
467  print(updated_documents)
468  # [Document(content='Extracted text from image.jpg',
469  #           meta={'file_path': 'image.jpg'}),
470  #  ...]
471  ```
472  
473  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a>
474  
475  #### LLMDocumentContentExtractor.\_\_init\_\_
476  
477  ```python
478  def __init__(*,
479               chat_generator: ChatGenerator,
480               prompt: str = DEFAULT_PROMPT_TEMPLATE,
481               file_path_meta_field: str = "file_path",
482               root_path: Optional[str] = None,
483               detail: Optional[Literal["auto", "high", "low"]] = None,
484               size: Optional[tuple[int, int]] = None,
485               raise_on_failure: bool = False,
486               max_workers: int = 3)
487  ```
488  
489  Initialize the LLMDocumentContentExtractor component.
490  
491  **Arguments**:
492  
493  - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must
494  support vision-based input and return a plain text response.
495  - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables.
496  The prompt should only contain instructions on how to extract the content of the image-based document.
497  - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF.
498  - `root_path`: The root directory path where document files are located. If provided, file paths in
499  document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
500  - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low".
501  This will be passed to chat_generator when processing the images.
502  - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while
503  maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial
504  when working with models that have resolution constraints or when transmitting images to remote services.
505  - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged
506  and returned.
507  - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a
508  ThreadPoolExecutor.
509  
510  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a>
511  
512  #### LLMDocumentContentExtractor.warm\_up
513  
514  ```python
515  def warm_up()
516  ```
517  
518  Warm up the ChatGenerator if it has a warm_up method.
519  
520  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a>
521  
522  #### LLMDocumentContentExtractor.to\_dict
523  
524  ```python
525  def to_dict() -> dict[str, Any]
526  ```
527  
528  Serializes the component to a dictionary.
529  
530  **Returns**:
531  
532  Dictionary with serialized data.
533  
534  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a>
535  
536  #### LLMDocumentContentExtractor.from\_dict
537  
538  ```python
539  @classmethod
540  def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor"
541  ```
542  
543  Deserializes the component from a dictionary.
544  
545  **Arguments**:
546  
547  - `data`: Dictionary with serialized data.
548  
549  **Returns**:
550  
551  An instance of the component.
552  
553  <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a>
554  
555  #### LLMDocumentContentExtractor.run
556  
557  ```python
558  @component.output_types(documents=list[Document],
559                          failed_documents=list[Document])
560  def run(documents: list[Document]) -> dict[str, list[Document]]
561  ```
562  
563  Run content extraction on a list of image-based documents using a vision-capable LLM.
564  
565  Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's
566  content. If the extraction fails, the document is returned in the `failed_documents` list with metadata
567  describing the failure.
568  
569  **Arguments**:
570  
571  - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata.
572  
573  **Returns**:
574  
575  A dictionary with:
576  - "documents": Successfully processed documents, updated with extracted content.
577  - "failed_documents": Documents that failed processing, annotated with failure metadata.