extractors_api.md
1 --- 2 title: Extractors 3 id: extractors-api 4 description: Extracts predefined entities out of a piece of text. 5 slug: "/extractors-api" 6 --- 7 8 <a id="named_entity_extractor"></a> 9 10 # Module named\_entity\_extractor 11 12 <a id="named_entity_extractor.NamedEntityExtractorBackend"></a> 13 14 ## NamedEntityExtractorBackend 15 16 NLP backend to use for Named Entity Recognition. 17 18 <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a> 19 20 #### HUGGING\_FACE 21 22 Uses an Hugging Face model and pipeline. 23 24 <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a> 25 26 #### SPACY 27 28 Uses a spaCy model and pipeline. 29 30 <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a> 31 32 #### NamedEntityExtractorBackend.from\_str 33 34 ```python 35 @staticmethod 36 def from_str(string: str) -> "NamedEntityExtractorBackend" 37 ``` 38 39 Convert a string to a NamedEntityExtractorBackend enum. 40 41 <a id="named_entity_extractor.NamedEntityAnnotation"></a> 42 43 ## NamedEntityAnnotation 44 45 Describes a single NER annotation. 46 47 **Arguments**: 48 49 - `entity`: Entity label. 50 - `start`: Start index of the entity in the document. 51 - `end`: End index of the entity in the document. 52 - `score`: Score calculated by the model. 53 54 <a id="named_entity_extractor.NamedEntityExtractor"></a> 55 56 ## NamedEntityExtractor 57 58 Annotates named entities in a collection of documents. 59 60 The component supports two backends: Hugging Face and spaCy. The 61 former can be used with any sequence classification model from the 62 [Hugging Face model hub](https://huggingface.co/models), while the 63 latter can be used with any [spaCy model](https://spacy.io/models) 64 that contains an NER component. Annotations are stored as metadata 65 in the documents. 66 67 Usage example: 68 ```python 69 from haystack import Document 70 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 71 72 documents = [ 73 Document(content="I'm Merlin, the happy pig!"), 74 Document(content="My name is Clara and I live in Berkeley, California."), 75 ] 76 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 77 extractor.warm_up() 78 results = extractor.run(documents=documents)["documents"] 79 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 80 print(annotations) 81 ``` 82 83 <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a> 84 85 #### NamedEntityExtractor.\_\_init\_\_ 86 87 ```python 88 def __init__( 89 *, 90 backend: Union[str, NamedEntityExtractorBackend], 91 model: str, 92 pipeline_kwargs: Optional[dict[str, Any]] = None, 93 device: Optional[ComponentDevice] = None, 94 token: Optional[Secret] = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], 95 strict=False) 96 ) -> None 97 ``` 98 99 Create a Named Entity extractor component. 100 101 **Arguments**: 102 103 - `backend`: Backend to use for NER. 104 - `model`: Name of the model or a path to the model on 105 the local disk. Dependent on the backend. 106 - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The 107 pipeline can override these arguments. Dependent on the backend. 108 - `device`: The device on which the model is loaded. If `None`, 109 the default device is automatically selected. If a 110 device/device map is specified in `pipeline_kwargs`, 111 it overrides this parameter (only applicable to the 112 HuggingFace backend). 113 - `token`: The API token to download private models from Hugging Face. 114 115 <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a> 116 117 #### NamedEntityExtractor.warm\_up 118 119 ```python 120 def warm_up() 121 ``` 122 123 Initialize the component. 124 125 **Raises**: 126 127 - `ComponentError`: If the backend fails to initialize successfully. 128 129 <a id="named_entity_extractor.NamedEntityExtractor.run"></a> 130 131 #### NamedEntityExtractor.run 132 133 ```python 134 @component.output_types(documents=list[Document]) 135 def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 136 ``` 137 138 Annotate named entities in each document and store the annotations in the document's metadata. 139 140 **Arguments**: 141 142 - `documents`: Documents to process. 143 - `batch_size`: Batch size used for processing the documents. 144 145 **Raises**: 146 147 - `ComponentError`: If the backend fails to process a document. 148 149 **Returns**: 150 151 Processed documents. 152 153 <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a> 154 155 #### NamedEntityExtractor.to\_dict 156 157 ```python 158 def to_dict() -> dict[str, Any] 159 ``` 160 161 Serializes the component to a dictionary. 162 163 **Returns**: 164 165 Dictionary with serialized data. 166 167 <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a> 168 169 #### NamedEntityExtractor.from\_dict 170 171 ```python 172 @classmethod 173 def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor" 174 ``` 175 176 Deserializes the component from a dictionary. 177 178 **Arguments**: 179 180 - `data`: Dictionary to deserialize from. 181 182 **Returns**: 183 184 Deserialized component. 185 186 <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a> 187 188 #### NamedEntityExtractor.initialized 189 190 ```python 191 @property 192 def initialized() -> bool 193 ``` 194 195 Returns if the extractor is ready to annotate text. 196 197 <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a> 198 199 #### NamedEntityExtractor.get\_stored\_annotations 200 201 ```python 202 @classmethod 203 def get_stored_annotations( 204 cls, document: Document) -> Optional[list[NamedEntityAnnotation]] 205 ``` 206 207 Returns the document's named entity annotations stored in its metadata, if any. 208 209 **Arguments**: 210 211 - `document`: Document whose annotations are to be fetched. 212 213 **Returns**: 214 215 The stored annotations. 216 217 <a id="llm_metadata_extractor"></a> 218 219 # Module llm\_metadata\_extractor 220 221 <a id="llm_metadata_extractor.LLMMetadataExtractor"></a> 222 223 ## LLMMetadataExtractor 224 225 Extracts metadata from documents using a Large Language Model (LLM). 226 227 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 228 229 This component expects as input a list of documents and a prompt. The prompt should have a variable called 230 `document` that will point to a single document in the list of documents. So to access the content of the document, 231 you can use `{{ document.content }}` in the prompt. 232 233 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 234 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 235 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 236 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 237 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 238 239 ```python 240 from haystack import Document 241 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 242 from haystack.components.generators.chat import OpenAIChatGenerator 243 244 NER_PROMPT = ''' 245 -Goal- 246 Given text and a list of entity types, identify all entities of those types from the text. 247 248 -Steps- 249 1. Identify all entities. For each identified entity, extract the following information: 250 - entity: Name of the entity 251 - entity_type: One of the following types: [organization, product, service, industry] 252 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 253 254 2. Return output in a single list with all the entities identified in steps 1. 255 256 -Examples- 257 ###################### 258 Example 1: 259 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 260 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 261 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 262 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 263 base and high cross-border usage. 264 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 265 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 266 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 267 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 268 agreement with Emirates Skywards. 269 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 270 issuers are equally 271 ------------------------ 272 output: 273 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 274 ############################# 275 -Real Data- 276 ###################### 277 entity_types: [company, organization, person, country, product, service] 278 text: {{ document.content }} 279 ###################### 280 output: 281 ''' 282 283 docs = [ 284 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 285 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 286 ] 287 288 chat_generator = OpenAIChatGenerator( 289 generation_kwargs={ 290 "max_tokens": 500, 291 "temperature": 0.0, 292 "seed": 0, 293 "response_format": {"type": "json_object"}, 294 }, 295 max_retries=1, 296 timeout=60.0, 297 ) 298 299 extractor = LLMMetadataExtractor( 300 prompt=NER_PROMPT, 301 chat_generator=generator, 302 expected_keys=["entities"], 303 raise_on_failure=False, 304 ) 305 306 extractor.warm_up() 307 extractor.run(documents=docs) 308 >> {'documents': [ 309 Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 310 meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 311 {'entity': 'Haystack', 'entity_type': 'product'}]}), 312 Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 313 meta: {'entities': [ 314 {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 315 {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 316 ]}) 317 ] 318 'failed_documents': [] 319 } 320 >> 321 ``` 322 323 <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a> 324 325 #### LLMMetadataExtractor.\_\_init\_\_ 326 327 ```python 328 def __init__(prompt: str, 329 chat_generator: ChatGenerator, 330 expected_keys: Optional[list[str]] = None, 331 page_range: Optional[list[Union[str, int]]] = None, 332 raise_on_failure: bool = False, 333 max_workers: int = 3) 334 ``` 335 336 Initializes the LLMMetadataExtractor. 337 338 **Arguments**: 339 340 - `prompt`: The prompt to be used for the LLM. 341 - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work, 342 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 343 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 344 - `expected_keys`: The keys expected in the JSON output from the LLM. 345 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 346 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 347 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 348 If None, metadata will be extracted from the entire document for each document in the documents list. 349 This parameter is optional and can be overridden in the `run` method. 350 - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or 351 validation of the JSON output. 352 - `max_workers`: The maximum number of workers to use in the thread pool executor. 353 354 <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a> 355 356 #### LLMMetadataExtractor.warm\_up 357 358 ```python 359 def warm_up() 360 ``` 361 362 Warm up the LLM provider component. 363 364 <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a> 365 366 #### LLMMetadataExtractor.to\_dict 367 368 ```python 369 def to_dict() -> dict[str, Any] 370 ``` 371 372 Serializes the component to a dictionary. 373 374 **Returns**: 375 376 Dictionary with serialized data. 377 378 <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a> 379 380 #### LLMMetadataExtractor.from\_dict 381 382 ```python 383 @classmethod 384 def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor" 385 ``` 386 387 Deserializes the component from a dictionary. 388 389 **Arguments**: 390 391 - `data`: Dictionary with serialized data. 392 393 **Returns**: 394 395 An instance of the component. 396 397 <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a> 398 399 #### LLMMetadataExtractor.run 400 401 ```python 402 @component.output_types(documents=list[Document], 403 failed_documents=list[Document]) 404 def run(documents: list[Document], 405 page_range: Optional[list[Union[str, int]]] = None) 406 ``` 407 408 Extract metadata from documents using a Large Language Model. 409 410 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 411 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 412 extracted from the entire document if `page_range` is not provided. 413 414 The original documents will be returned updated with the extracted metadata. 415 416 **Arguments**: 417 418 - `documents`: List of documents to extract metadata from. 419 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 420 metadata from the first and third pages of each document. It also accepts printable range 421 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 422 11, 12. 423 If None, metadata will be extracted from the entire document for each document in the 424 documents list. 425 426 **Returns**: 427 428 A dictionary with the keys: 429 - "documents": A list of documents that were successfully updated with the extracted metadata. 430 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 431 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 432 re-run with the extractor to extract metadata. 433 434 <a id="image/llm_document_content_extractor"></a> 435 436 # Module image/llm\_document\_content\_extractor 437 438 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a> 439 440 ## LLMDocumentContentExtractor 441 442 Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model). 443 444 This component converts each input document into an image using the DocumentToImageContent component, 445 uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured 446 textual content based on the provided prompt. 447 448 The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt 449 are passed together to the LLM as a chat message. 450 451 Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These 452 failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for 453 debugging or for reprocessing the documents later. 454 455 ### Usage example 456 ```python 457 from haystack import Document 458 from haystack.components.generators.chat import OpenAIChatGenerator 459 from haystack.components.extractors.image import LLMDocumentContentExtractor 460 chat_generator = OpenAIChatGenerator() 461 extractor = LLMDocumentContentExtractor(chat_generator=chat_generator) 462 documents = [ 463 Document(content="", meta={"file_path": "image.jpg"}), 464 Document(content="", meta={"file_path": "document.pdf", "page_number": 1}), 465 ] 466 updated_documents = extractor.run(documents=documents)["documents"] 467 print(updated_documents) 468 # [Document(content='Extracted text from image.jpg', 469 # meta={'file_path': 'image.jpg'}), 470 # ...] 471 ``` 472 473 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a> 474 475 #### LLMDocumentContentExtractor.\_\_init\_\_ 476 477 ```python 478 def __init__(*, 479 chat_generator: ChatGenerator, 480 prompt: str = DEFAULT_PROMPT_TEMPLATE, 481 file_path_meta_field: str = "file_path", 482 root_path: Optional[str] = None, 483 detail: Optional[Literal["auto", "high", "low"]] = None, 484 size: Optional[tuple[int, int]] = None, 485 raise_on_failure: bool = False, 486 max_workers: int = 3) 487 ``` 488 489 Initialize the LLMDocumentContentExtractor component. 490 491 **Arguments**: 492 493 - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must 494 support vision-based input and return a plain text response. 495 - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables. 496 The prompt should only contain instructions on how to extract the content of the image-based document. 497 - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF. 498 - `root_path`: The root directory path where document files are located. If provided, file paths in 499 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 500 - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 501 This will be passed to chat_generator when processing the images. 502 - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while 503 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 504 when working with models that have resolution constraints or when transmitting images to remote services. 505 - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged 506 and returned. 507 - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a 508 ThreadPoolExecutor. 509 510 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a> 511 512 #### LLMDocumentContentExtractor.warm\_up 513 514 ```python 515 def warm_up() 516 ``` 517 518 Warm up the ChatGenerator if it has a warm_up method. 519 520 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a> 521 522 #### LLMDocumentContentExtractor.to\_dict 523 524 ```python 525 def to_dict() -> dict[str, Any] 526 ``` 527 528 Serializes the component to a dictionary. 529 530 **Returns**: 531 532 Dictionary with serialized data. 533 534 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a> 535 536 #### LLMDocumentContentExtractor.from\_dict 537 538 ```python 539 @classmethod 540 def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor" 541 ``` 542 543 Deserializes the component from a dictionary. 544 545 **Arguments**: 546 547 - `data`: Dictionary with serialized data. 548 549 **Returns**: 550 551 An instance of the component. 552 553 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a> 554 555 #### LLMDocumentContentExtractor.run 556 557 ```python 558 @component.output_types(documents=list[Document], 559 failed_documents=list[Document]) 560 def run(documents: list[Document]) -> dict[str, list[Document]] 561 ``` 562 563 Run content extraction on a list of image-based documents using a vision-capable LLM. 564 565 Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's 566 content. If the extraction fails, the document is returned in the `failed_documents` list with metadata 567 describing the failure. 568 569 **Arguments**: 570 571 - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata. 572 573 **Returns**: 574 575 A dictionary with: 576 - "documents": Successfully processed documents, updated with extracted content. 577 - "failed_documents": Documents that failed processing, annotated with failure metadata.