extractors_api.md
1 --- 2 title: "Extractors" 3 id: extractors-api 4 description: "Components to extract specific elements from textual data." 5 slug: "/extractors-api" 6 --- 7 8 <a id="image/llm_document_content_extractor"></a> 9 10 ## Module image/llm\_document\_content\_extractor 11 12 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a> 13 14 ### LLMDocumentContentExtractor 15 16 Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model). 17 18 This component converts each input document into an image using the DocumentToImageContent component, 19 uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured 20 textual content based on the provided prompt. 21 22 The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt 23 are passed together to the LLM as a chat message. 24 25 Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These 26 failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for 27 debugging or for reprocessing the documents later. 28 29 ### Usage example 30 ```python 31 from haystack import Document 32 from haystack.components.generators.chat import OpenAIChatGenerator 33 from haystack.components.extractors.image import LLMDocumentContentExtractor 34 chat_generator = OpenAIChatGenerator() 35 extractor = LLMDocumentContentExtractor(chat_generator=chat_generator) 36 documents = [ 37 Document(content="", meta={"file_path": "image.jpg"}), 38 Document(content="", meta={"file_path": "document.pdf", "page_number": 1}), 39 ] 40 updated_documents = extractor.run(documents=documents)["documents"] 41 print(updated_documents) 42 # [Document(content='Extracted text from image.jpg', 43 # meta={'file_path': 'image.jpg'}), 44 # ...] 45 ``` 46 47 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a> 48 49 #### LLMDocumentContentExtractor.\_\_init\_\_ 50 51 ```python 52 def __init__(*, 53 chat_generator: ChatGenerator, 54 prompt: str = DEFAULT_PROMPT_TEMPLATE, 55 file_path_meta_field: str = "file_path", 56 root_path: str | None = None, 57 detail: Literal["auto", "high", "low"] | None = None, 58 size: tuple[int, int] | None = None, 59 raise_on_failure: bool = False, 60 max_workers: int = 3) 61 ``` 62 63 Initialize the LLMDocumentContentExtractor component. 64 65 **Arguments**: 66 67 - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must 68 support vision-based input and return a plain text response. 69 - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables. 70 The prompt should only contain instructions on how to extract the content of the image-based document. 71 - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF. 72 - `root_path`: The root directory path where document files are located. If provided, file paths in 73 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 74 - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 75 This will be passed to chat_generator when processing the images. 76 - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while 77 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 78 when working with models that have resolution constraints or when transmitting images to remote services. 79 - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged 80 and returned. 81 - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a 82 ThreadPoolExecutor. 83 84 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a> 85 86 #### LLMDocumentContentExtractor.warm\_up 87 88 ```python 89 def warm_up() 90 ``` 91 92 Warm up the ChatGenerator if it has a warm_up method. 93 94 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a> 95 96 #### LLMDocumentContentExtractor.to\_dict 97 98 ```python 99 def to_dict() -> dict[str, Any] 100 ``` 101 102 Serializes the component to a dictionary. 103 104 **Returns**: 105 106 Dictionary with serialized data. 107 108 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a> 109 110 #### LLMDocumentContentExtractor.from\_dict 111 112 ```python 113 @classmethod 114 def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor" 115 ``` 116 117 Deserializes the component from a dictionary. 118 119 **Arguments**: 120 121 - `data`: Dictionary with serialized data. 122 123 **Returns**: 124 125 An instance of the component. 126 127 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a> 128 129 #### LLMDocumentContentExtractor.run 130 131 ```python 132 @component.output_types(documents=list[Document], 133 failed_documents=list[Document]) 134 def run(documents: list[Document]) -> dict[str, list[Document]] 135 ``` 136 137 Run content extraction on a list of image-based documents using a vision-capable LLM. 138 139 Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's 140 content. If the extraction fails, the document is returned in the `failed_documents` list with metadata 141 describing the failure. 142 143 **Arguments**: 144 145 - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata. 146 147 **Returns**: 148 149 A dictionary with: 150 - "documents": Successfully processed documents, updated with extracted content. 151 - "failed_documents": Documents that failed processing, annotated with failure metadata. 152 153 <a id="llm_metadata_extractor"></a> 154 155 ## Module llm\_metadata\_extractor 156 157 <a id="llm_metadata_extractor.LLMMetadataExtractor"></a> 158 159 ### LLMMetadataExtractor 160 161 Extracts metadata from documents using a Large Language Model (LLM). 162 163 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 164 165 This component expects as input a list of documents and a prompt. The prompt should have a variable called 166 `document` that will point to a single document in the list of documents. So to access the content of the document, 167 you can use `{{ document.content }}` in the prompt. 168 169 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 170 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 171 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 172 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 173 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 174 175 ```python 176 from haystack import Document 177 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 178 from haystack.components.generators.chat import OpenAIChatGenerator 179 180 NER_PROMPT = ''' 181 -Goal- 182 Given text and a list of entity types, identify all entities of those types from the text. 183 184 -Steps- 185 1. Identify all entities. For each identified entity, extract the following information: 186 - entity: Name of the entity 187 - entity_type: One of the following types: [organization, product, service, industry] 188 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 189 190 2. Return output in a single list with all the entities identified in steps 1. 191 192 -Examples- 193 ###################### 194 Example 1: 195 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 196 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 197 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 198 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 199 base and high cross-border usage. 200 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 201 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 202 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 203 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 204 agreement with Emirates Skywards. 205 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 206 issuers are equally 207 ------------------------ 208 output: 209 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 210 ############################# 211 -Real Data- 212 ###################### 213 entity_types: [company, organization, person, country, product, service] 214 text: {{ document.content }} 215 ###################### 216 output: 217 ''' 218 219 docs = [ 220 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 221 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 222 ] 223 224 chat_generator = OpenAIChatGenerator( 225 generation_kwargs={ 226 "max_completion_tokens": 500, 227 "temperature": 0.0, 228 "seed": 0, 229 "response_format": { 230 "type": "json_schema", 231 "json_schema": { 232 "name": "entity_extraction", 233 "schema": { 234 "type": "object", 235 "properties": { 236 "entities": { 237 "type": "array", 238 "items": { 239 "type": "object", 240 "properties": { 241 "entity": {"type": "string"}, 242 "entity_type": {"type": "string"} 243 }, 244 "required": ["entity", "entity_type"], 245 "additionalProperties": False 246 } 247 } 248 }, 249 "required": ["entities"], 250 "additionalProperties": False 251 } 252 } 253 }, 254 }, 255 max_retries=1, 256 timeout=60.0, 257 ) 258 259 extractor = LLMMetadataExtractor( 260 prompt=NER_PROMPT, 261 chat_generator=generator, 262 expected_keys=["entities"], 263 raise_on_failure=False, 264 ) 265 266 extractor.warm_up() 267 extractor.run(documents=docs) 268 >> {'documents': [ 269 Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 270 meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 271 {'entity': 'Haystack', 'entity_type': 'product'}]}), 272 Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 273 meta: {'entities': [ 274 {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 275 {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 276 ]}) 277 ] 278 'failed_documents': [] 279 } 280 >> 281 ``` 282 283 <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a> 284 285 #### LLMMetadataExtractor.\_\_init\_\_ 286 287 ```python 288 def __init__(prompt: str, 289 chat_generator: ChatGenerator, 290 expected_keys: list[str] | None = None, 291 page_range: list[str | int] | None = None, 292 raise_on_failure: bool = False, 293 max_workers: int = 3) 294 ``` 295 296 Initializes the LLMMetadataExtractor. 297 298 **Arguments**: 299 300 - `prompt`: The prompt to be used for the LLM. 301 - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work, 302 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 303 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 304 - `expected_keys`: The keys expected in the JSON output from the LLM. 305 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 306 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 307 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 308 If None, metadata will be extracted from the entire document for each document in the documents list. 309 This parameter is optional and can be overridden in the `run` method. 310 - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or 311 validation of the JSON output. 312 - `max_workers`: The maximum number of workers to use in the thread pool executor. 313 314 <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a> 315 316 #### LLMMetadataExtractor.warm\_up 317 318 ```python 319 def warm_up() 320 ``` 321 322 Warm up the LLM provider component. 323 324 <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a> 325 326 #### LLMMetadataExtractor.to\_dict 327 328 ```python 329 def to_dict() -> dict[str, Any] 330 ``` 331 332 Serializes the component to a dictionary. 333 334 **Returns**: 335 336 Dictionary with serialized data. 337 338 <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a> 339 340 #### LLMMetadataExtractor.from\_dict 341 342 ```python 343 @classmethod 344 def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor" 345 ``` 346 347 Deserializes the component from a dictionary. 348 349 **Arguments**: 350 351 - `data`: Dictionary with serialized data. 352 353 **Returns**: 354 355 An instance of the component. 356 357 <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a> 358 359 #### LLMMetadataExtractor.run 360 361 ```python 362 @component.output_types(documents=list[Document], 363 failed_documents=list[Document]) 364 def run(documents: list[Document], page_range: list[str | int] | None = None) 365 ``` 366 367 Extract metadata from documents using a Large Language Model. 368 369 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 370 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 371 extracted from the entire document if `page_range` is not provided. 372 373 The original documents will be returned updated with the extracted metadata. 374 375 **Arguments**: 376 377 - `documents`: List of documents to extract metadata from. 378 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 379 metadata from the first and third pages of each document. It also accepts printable range 380 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 381 11, 12. 382 If None, metadata will be extracted from the entire document for each document in the 383 documents list. 384 385 **Returns**: 386 387 A dictionary with the keys: 388 - "documents": A list of documents that were successfully updated with the extracted metadata. 389 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 390 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 391 re-run with the extractor to extract metadata. 392 393 <a id="named_entity_extractor"></a> 394 395 ## Module named\_entity\_extractor 396 397 <a id="named_entity_extractor.NamedEntityExtractorBackend"></a> 398 399 ### NamedEntityExtractorBackend 400 401 NLP backend to use for Named Entity Recognition. 402 403 <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a> 404 405 #### HUGGING\_FACE 406 407 Uses an Hugging Face model and pipeline. 408 409 <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a> 410 411 #### SPACY 412 413 Uses a spaCy model and pipeline. 414 415 <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a> 416 417 #### NamedEntityExtractorBackend.from\_str 418 419 ```python 420 @staticmethod 421 def from_str(string: str) -> "NamedEntityExtractorBackend" 422 ``` 423 424 Convert a string to a NamedEntityExtractorBackend enum. 425 426 <a id="named_entity_extractor.NamedEntityAnnotation"></a> 427 428 ### NamedEntityAnnotation 429 430 Describes a single NER annotation. 431 432 **Arguments**: 433 434 - `entity`: Entity label. 435 - `start`: Start index of the entity in the document. 436 - `end`: End index of the entity in the document. 437 - `score`: Score calculated by the model. 438 439 <a id="named_entity_extractor.NamedEntityExtractor"></a> 440 441 ### NamedEntityExtractor 442 443 Annotates named entities in a collection of documents. 444 445 The component supports two backends: Hugging Face and spaCy. The 446 former can be used with any sequence classification model from the 447 [Hugging Face model hub](https://huggingface.co/models), while the 448 latter can be used with any [spaCy model](https://spacy.io/models) 449 that contains an NER component. Annotations are stored as metadata 450 in the documents. 451 452 Usage example: 453 ```python 454 from haystack import Document 455 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 456 457 documents = [ 458 Document(content="I'm Merlin, the happy pig!"), 459 Document(content="My name is Clara and I live in Berkeley, California."), 460 ] 461 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 462 extractor.warm_up() 463 results = extractor.run(documents=documents)["documents"] 464 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 465 print(annotations) 466 ``` 467 468 <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a> 469 470 #### NamedEntityExtractor.\_\_init\_\_ 471 472 ```python 473 def __init__( 474 *, 475 backend: str | NamedEntityExtractorBackend, 476 model: str, 477 pipeline_kwargs: dict[str, Any] | None = None, 478 device: ComponentDevice | None = None, 479 token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], 480 strict=False) 481 ) -> None 482 ``` 483 484 Create a Named Entity extractor component. 485 486 **Arguments**: 487 488 - `backend`: Backend to use for NER. 489 - `model`: Name of the model or a path to the model on 490 the local disk. Dependent on the backend. 491 - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The 492 pipeline can override these arguments. Dependent on the backend. 493 - `device`: The device on which the model is loaded. If `None`, 494 the default device is automatically selected. If a 495 device/device map is specified in `pipeline_kwargs`, 496 it overrides this parameter (only applicable to the 497 HuggingFace backend). 498 - `token`: The API token to download private models from Hugging Face. 499 500 <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a> 501 502 #### NamedEntityExtractor.warm\_up 503 504 ```python 505 def warm_up() 506 ``` 507 508 Initialize the component. 509 510 **Raises**: 511 512 - `ComponentError`: If the backend fails to initialize successfully. 513 514 <a id="named_entity_extractor.NamedEntityExtractor.run"></a> 515 516 #### NamedEntityExtractor.run 517 518 ```python 519 @component.output_types(documents=list[Document]) 520 def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 521 ``` 522 523 Annotate named entities in each document and store the annotations in the document's metadata. 524 525 **Arguments**: 526 527 - `documents`: Documents to process. 528 - `batch_size`: Batch size used for processing the documents. 529 530 **Raises**: 531 532 - `ComponentError`: If the backend fails to process a document. 533 534 **Returns**: 535 536 Processed documents. 537 538 <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a> 539 540 #### NamedEntityExtractor.to\_dict 541 542 ```python 543 def to_dict() -> dict[str, Any] 544 ``` 545 546 Serializes the component to a dictionary. 547 548 **Returns**: 549 550 Dictionary with serialized data. 551 552 <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a> 553 554 #### NamedEntityExtractor.from\_dict 555 556 ```python 557 @classmethod 558 def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor" 559 ``` 560 561 Deserializes the component from a dictionary. 562 563 **Arguments**: 564 565 - `data`: Dictionary to deserialize from. 566 567 **Returns**: 568 569 Deserialized component. 570 571 <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a> 572 573 #### NamedEntityExtractor.initialized 574 575 ```python 576 @property 577 def initialized() -> bool 578 ``` 579 580 Returns if the extractor is ready to annotate text. 581 582 <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a> 583 584 #### NamedEntityExtractor.get\_stored\_annotations 585 586 ```python 587 @classmethod 588 def get_stored_annotations( 589 cls, document: Document) -> list[NamedEntityAnnotation] | None 590 ``` 591 592 Returns the document's named entity annotations stored in its metadata, if any. 593 594 **Arguments**: 595 596 - `document`: Document whose annotations are to be fetched. 597 598 **Returns**: 599 600 The stored annotations. 601 602 <a id="regex_text_extractor"></a> 603 604 ## Module regex\_text\_extractor 605 606 <a id="regex_text_extractor.RegexTextExtractor"></a> 607 608 ### RegexTextExtractor 609 610 Extracts text from chat message or string input using a regex pattern. 611 612 RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern. 613 It can be configured to search through all messages or only the last message in a list of ChatMessages. 614 615 ### Usage example 616 617 ```python 618 from haystack.components.extractors import RegexTextExtractor 619 from haystack.dataclasses import ChatMessage 620 621 # Using with a string 622 parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">') 623 result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>') 624 # result: {"captured_text": "github.com/hahahaha"} 625 626 # Using with ChatMessages 627 messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')] 628 result = parser.run(text_or_messages=messages) 629 # result: {"captured_text": "github.com/hahahaha"} 630 ``` 631 632 <a id="regex_text_extractor.RegexTextExtractor.__init__"></a> 633 634 #### RegexTextExtractor.\_\_init\_\_ 635 636 ```python 637 def __init__(regex_pattern: str) 638 ``` 639 640 Creates an instance of the RegexTextExtractor component. 641 642 **Arguments**: 643 644 - `regex_pattern`: The regular expression pattern used to extract text. 645 The pattern should include a capture group to extract the desired text. 646 Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`. 647 648 <a id="regex_text_extractor.RegexTextExtractor.to_dict"></a> 649 650 #### RegexTextExtractor.to\_dict 651 652 ```python 653 def to_dict() -> dict[str, Any] 654 ``` 655 656 Serializes the component to a dictionary. 657 658 **Returns**: 659 660 Dictionary with serialized data. 661 662 <a id="regex_text_extractor.RegexTextExtractor.from_dict"></a> 663 664 #### RegexTextExtractor.from\_dict 665 666 ```python 667 @classmethod 668 def from_dict(cls, data: dict[str, Any]) -> "RegexTextExtractor" 669 ``` 670 671 Deserializes the component from a dictionary. 672 673 **Arguments**: 674 675 - `data`: The dictionary to deserialize from. 676 677 **Returns**: 678 679 The deserialized component. 680 681 <a id="regex_text_extractor.RegexTextExtractor.run"></a> 682 683 #### RegexTextExtractor.run 684 685 ```python 686 @component.output_types(captured_text=str) 687 def run(text_or_messages: str | list[ChatMessage]) -> dict[str, str] 688 ``` 689 690 Extracts text from input using the configured regex pattern. 691 692 **Arguments**: 693 694 - `text_or_messages`: Either a string or a list of ChatMessage objects to search through. 695 696 **Raises**: 697 698 - `None`: - ValueError: if receiving a list the last element is not a ChatMessage instance. 699 700 **Returns**: 701 702 - `{"captured_text": "matched text"}` if a match is found 703 - `{"captured_text": ""}` if no match is found 704