extractors_api.md
1 --- 2 title: "Extractors" 3 id: extractors-api 4 description: "Components to extract specific elements from textual data." 5 slug: "/extractors-api" 6 --- 7 8 9 ## image/llm_document_content_extractor 10 11 ### LLMDocumentContentExtractor 12 13 Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM. 14 15 One prompt and one LLM call per document. The component converts each document to an image via 16 DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables. 17 18 Response handling: 19 20 - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content. 21 - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content. 22 - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is 23 written to content and all other keys are merged into the document's metadata. 24 25 The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}` 26 in `generation_kwargs`). 27 28 Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata. 29 30 ### Usage example 31 32 ```python 33 from haystack import Document 34 from haystack.components.generators.chat import OpenAIChatGenerator 35 from haystack.components.extractors.image import LLMDocumentContentExtractor 36 37 prompt = """ 38 Extract the content from the provided image. 39 Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'. 40 No markdown, no code fence, only raw JSON. 41 42 Extract metadata about the image like source of the image, date of creation, etc. if you can. 43 Return this metadata as additional key-value pairs in the same JSON object. 44 """ 45 46 chat_generator = OpenAIChatGenerator( 47 generation_kwargs={ 48 "response_format": { 49 "type": "json_schema", 50 "json_schema": { 51 "name": "entity_extraction", 52 "schema": { 53 "type": "object", 54 "properties": { 55 "document_content": {"type": "string"}, 56 "author": {"type": "string"}, 57 "date": {"type": "string"}, 58 "document_type": {"type": "string"}, 59 "title": {"type": "string"}, 60 }, 61 "additionalProperties": False, 62 }, 63 }, 64 } 65 } 66 ) 67 68 extractor = LLMDocumentContentExtractor( 69 chat_generator=chat_generator, 70 file_path_meta_field="file_path", 71 raise_on_failure=False 72 ) 73 74 documents = [ 75 Document(content="", meta={"file_path": "test/test_files/images/image_metadata.png"}), 76 Document(content="", meta={"file_path": "test/test_files/images/apple.jpg", "page_number": 1}) 77 ] 78 result = extractor.run(documents=documents) 79 updated_documents = result["documents"] 80 ``` 81 82 #### __init__ 83 84 ```python 85 __init__( 86 *, 87 chat_generator: ChatGenerator, 88 prompt: str = DEFAULT_PROMPT_TEMPLATE, 89 file_path_meta_field: str = "file_path", 90 root_path: str | None = None, 91 detail: Literal["auto", "high", "low"] | None = None, 92 size: tuple[int, int] | None = None, 93 raise_on_failure: bool = False, 94 max_workers: int = 3 95 ) -> None 96 ``` 97 98 Initialize the LLMDocumentContentExtractor component. 99 100 **Parameters:** 101 102 - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON 103 (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`). 104 - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables. 105 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 106 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 107 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 108 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 109 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio. 110 - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned. 111 - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls. 112 113 #### warm_up 114 115 ```python 116 warm_up() -> None 117 ``` 118 119 Warm up the ChatGenerator if it has a warm_up method. 120 121 #### to_dict 122 123 ```python 124 to_dict() -> dict[str, Any] 125 ``` 126 127 Serializes the component to a dictionary. 128 129 **Returns:** 130 131 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 132 133 #### from_dict 134 135 ```python 136 from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor 137 ``` 138 139 Deserializes the component from a dictionary. 140 141 **Parameters:** 142 143 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 144 145 **Returns:** 146 147 - <code>LLMDocumentContentExtractor</code> – An instance of the component. 148 149 #### run 150 151 ```python 152 run(documents: list[Document]) -> dict[str, list[Document]] 153 ``` 154 155 Run extraction on image-based documents. One LLM call per document. 156 157 **Parameters:** 158 159 - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata. 160 161 **Returns:** 162 163 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata). 164 165 ## llm_metadata_extractor 166 167 ### LLMMetadataExtractor 168 169 Extracts metadata from documents using a Large Language Model (LLM). 170 171 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 172 173 This component expects as input a list of documents and a prompt. The prompt should have a variable called 174 `document` that will point to a single document in the list of documents. So to access the content of the document, 175 you can use `{{ document.content }}` in the prompt. 176 177 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 178 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 179 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 180 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 181 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 182 183 ```python 184 from haystack import Document 185 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 186 from haystack.components.generators.chat import OpenAIChatGenerator 187 188 NER_PROMPT = ''' 189 -Goal- 190 Given text and a list of entity types, identify all entities of those types from the text. 191 192 -Steps- 193 1. Identify all entities. For each identified entity, extract the following information: 194 - entity: Name of the entity 195 - entity_type: One of the following types: [organization, product, service, industry] 196 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 197 198 2. Return output in a single list with all the entities identified in steps 1. 199 200 -Examples- 201 ###################### 202 Example 1: 203 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 204 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 205 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 206 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 207 base and high cross-border usage. 208 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 209 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 210 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 211 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 212 agreement with Emirates Skywards. 213 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 214 issuers are equally 215 ------------------------ 216 output: 217 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 218 ############################# 219 -Real Data- 220 ###################### 221 entity_types: [company, organization, person, country, product, service] 222 text: {{ document.content }} 223 ###################### 224 output: 225 ''' 226 227 docs = [ 228 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 229 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 230 ] 231 232 chat_generator = OpenAIChatGenerator( 233 generation_kwargs={ 234 "max_completion_tokens": 500, 235 "temperature": 0.0, 236 "seed": 0, 237 "response_format": { 238 "type": "json_schema", 239 "json_schema": { 240 "name": "entity_extraction", 241 "schema": { 242 "type": "object", 243 "properties": { 244 "entities": { 245 "type": "array", 246 "items": { 247 "type": "object", 248 "properties": { 249 "entity": {"type": "string"}, 250 "entity_type": {"type": "string"} 251 }, 252 "required": ["entity", "entity_type"], 253 "additionalProperties": False 254 } 255 } 256 }, 257 "required": ["entities"], 258 "additionalProperties": False 259 } 260 } 261 }, 262 }, 263 max_retries=1, 264 timeout=60.0, 265 ) 266 267 extractor = LLMMetadataExtractor( 268 prompt=NER_PROMPT, 269 chat_generator=chat_generator, 270 expected_keys=["entities"], 271 raise_on_failure=False, 272 ) 273 274 extractor.run(documents=docs) 275 # >> {'documents': [ 276 # Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 277 # meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 278 # {'entity': 'Haystack', 'entity_type': 'product'}]}), 279 # Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 280 # meta: {'entities': [ 281 # {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 282 # {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 283 # ]}) 284 # ] 285 # 'failed_documents': [] 286 # } 287 # >> 288 ``` 289 290 #### __init__ 291 292 ```python 293 __init__( 294 prompt: str, 295 chat_generator: ChatGenerator, 296 expected_keys: list[str] | None = None, 297 page_range: list[str | int] | None = None, 298 raise_on_failure: bool = False, 299 max_workers: int = 3, 300 ) -> None 301 ``` 302 303 Initializes the LLMMetadataExtractor. 304 305 **Parameters:** 306 307 - **prompt** (<code>str</code>) – The prompt to be used for the LLM. 308 - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work, 309 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 310 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 311 - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM. 312 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 313 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 314 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 315 If None, metadata will be extracted from the entire document for each document in the documents list. 316 This parameter is optional and can be overridden in the `run` method. 317 - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or 318 validation of the JSON output. 319 - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor. 320 This parameter is used limit the maximum number of requests that should be allowed to run concurrently 321 when using the `run_async` method. 322 323 #### warm_up 324 325 ```python 326 warm_up() -> None 327 ``` 328 329 Warm up the LLM provider component. 330 331 #### to_dict 332 333 ```python 334 to_dict() -> dict[str, Any] 335 ``` 336 337 Serializes the component to a dictionary. 338 339 **Returns:** 340 341 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 342 343 #### from_dict 344 345 ```python 346 from_dict(data: dict[str, Any]) -> LLMMetadataExtractor 347 ``` 348 349 Deserializes the component from a dictionary. 350 351 **Parameters:** 352 353 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 354 355 **Returns:** 356 357 - <code>LLMMetadataExtractor</code> – An instance of the component. 358 359 #### run 360 361 ```python 362 run( 363 documents: list[Document], page_range: list[str | int] | None = None 364 ) -> dict[str, Any] 365 ``` 366 367 Extract metadata from documents using a Large Language Model. 368 369 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 370 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 371 extracted from the entire document if `page_range` is not provided. 372 373 The original documents will be returned updated with the extracted metadata. 374 375 **Parameters:** 376 377 - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from. 378 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 379 metadata from the first and third pages of each document. It also accepts printable range 380 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 381 11, 12. 382 If None, metadata will be extracted from the entire document for each document in the 383 documents list. 384 385 **Returns:** 386 387 - <code>dict\[str, Any\]</code> – A dictionary with the keys: 388 - "documents": A list of documents that were successfully updated with the extracted metadata. 389 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 390 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 391 re-run with the extractor to extract metadata. 392 393 #### run_async 394 395 ```python 396 run_async( 397 documents: list[Document], page_range: list[str | int] | None = None 398 ) -> dict[str, Any] 399 ``` 400 401 Asynchronously extract metadata from documents using a Large Language Model. 402 403 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 404 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 405 extracted from the entire document if `page_range` is not provided. 406 407 The original documents will be returned updated with the extracted metadata. 408 409 This is the asynchronous version of the `run` method. It has the same parameters 410 and return values but can be used with `await` in an async code. 411 412 **Parameters:** 413 414 - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from. 415 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 416 metadata from the first and third pages of each document. It also accepts printable range 417 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 418 11, 12. 419 If None, metadata will be extracted from the entire document for each document in the 420 documents list. 421 422 **Returns:** 423 424 - <code>dict\[str, Any\]</code> – A dictionary with the keys: 425 - "documents": A list of documents that were successfully updated with the extracted metadata. 426 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 427 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 428 re-run with the extractor to extract metadata. 429 430 ## named_entity_extractor 431 432 ### NamedEntityExtractorBackend 433 434 Bases: <code>Enum</code> 435 436 NLP backend to use for Named Entity Recognition. 437 438 #### from_str 439 440 ```python 441 from_str(string: str) -> NamedEntityExtractorBackend 442 ``` 443 444 Convert a string to a NamedEntityExtractorBackend enum. 445 446 ### NamedEntityAnnotation 447 448 Describes a single NER annotation. 449 450 **Parameters:** 451 452 - **entity** (<code>str</code>) – Entity label. 453 - **start** (<code>int</code>) – Start index of the entity in the document. 454 - **end** (<code>int</code>) – End index of the entity in the document. 455 - **score** (<code>float | None</code>) – Score calculated by the model. 456 457 ### NamedEntityExtractor 458 459 Annotates named entities in a collection of documents. 460 461 The component supports two backends: Hugging Face and spaCy. The 462 former can be used with any sequence classification model from the 463 [Hugging Face model hub](https://huggingface.co/models), while the 464 latter can be used with any [spaCy model](https://spacy.io/models) 465 that contains an NER component. Annotations are stored as metadata 466 in the documents. 467 468 Usage example: 469 470 <!-- test-ignore --> 471 472 ```python 473 from haystack import Document 474 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 475 476 documents = [ 477 Document(content="I'm Merlin, the happy pig!"), 478 Document(content="My name is Clara and I live in Berkeley, California."), 479 ] 480 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 481 results = extractor.run(documents=documents)["documents"] 482 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 483 print(annotations) 484 ``` 485 486 #### __init__ 487 488 ```python 489 __init__( 490 *, 491 backend: str | NamedEntityExtractorBackend, 492 model: str, 493 pipeline_kwargs: dict[str, Any] | None = None, 494 device: ComponentDevice | None = None, 495 token: Secret | None = Secret.from_env_var( 496 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 497 ) 498 ) -> None 499 ``` 500 501 Create a Named Entity extractor component. 502 503 **Parameters:** 504 505 - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER. 506 - **model** (<code>str</code>) – Name of the model or a path to the model on 507 the local disk. Dependent on the backend. 508 - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The 509 pipeline can override these arguments. Dependent on the backend. 510 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, 511 the default device is automatically selected. If a 512 device/device map is specified in `pipeline_kwargs`, 513 it overrides this parameter (only applicable to the 514 HuggingFace backend). 515 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 516 517 #### warm_up 518 519 ```python 520 warm_up() -> None 521 ``` 522 523 Initialize the component. 524 525 **Raises:** 526 527 - <code>ComponentError</code> – If the backend fails to initialize successfully. 528 529 #### run 530 531 ```python 532 run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 533 ``` 534 535 Annotate named entities in each document and store the annotations in the document's metadata. 536 537 **Parameters:** 538 539 - **documents** (<code>list\[Document\]</code>) – Documents to process. 540 - **batch_size** (<code>int</code>) – Batch size used for processing the documents. 541 542 **Returns:** 543 544 - <code>dict\[str, Any\]</code> – Processed documents. 545 546 **Raises:** 547 548 - <code>ComponentError</code> – If the backend fails to process a document. 549 550 #### to_dict 551 552 ```python 553 to_dict() -> dict[str, Any] 554 ``` 555 556 Serializes the component to a dictionary. 557 558 **Returns:** 559 560 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 561 562 #### from_dict 563 564 ```python 565 from_dict(data: dict[str, Any]) -> NamedEntityExtractor 566 ``` 567 568 Deserializes the component from a dictionary. 569 570 **Parameters:** 571 572 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 573 574 **Returns:** 575 576 - <code>NamedEntityExtractor</code> – Deserialized component. 577 578 #### initialized 579 580 ```python 581 initialized: bool 582 ``` 583 584 Returns if the extractor is ready to annotate text. 585 586 #### get_stored_annotations 587 588 ```python 589 get_stored_annotations( 590 document: Document, 591 ) -> list[NamedEntityAnnotation] | None 592 ``` 593 594 Returns the document's named entity annotations stored in its metadata, if any. 595 596 **Parameters:** 597 598 - **document** (<code>Document</code>) – Document whose annotations are to be fetched. 599 600 **Returns:** 601 602 - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations. 603 604 ## regex_text_extractor 605 606 ### RegexTextExtractor 607 608 Extracts text from chat message or string input using a regex pattern. 609 610 RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern. 611 It can be configured to search through all messages or only the last message in a list of ChatMessages. 612 613 ### Usage example 614 615 ```python 616 from haystack.components.extractors import RegexTextExtractor 617 from haystack.dataclasses import ChatMessage 618 619 # Using with a string 620 parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">') 621 result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>') 622 # result: {"captured_text": "github.com/hahahaha"} 623 624 # Using with ChatMessages 625 messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')] 626 result = parser.run(text_or_messages=messages) 627 # result: {"captured_text": "github.com/hahahaha"} 628 ``` 629 630 #### __init__ 631 632 ```python 633 __init__(regex_pattern: str) -> None 634 ``` 635 636 Creates an instance of the RegexTextExtractor component. 637 638 **Parameters:** 639 640 - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text. 641 The pattern should include a capture group to extract the desired text. 642 Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`. 643 644 #### to_dict 645 646 ```python 647 to_dict() -> dict[str, Any] 648 ``` 649 650 Serializes the component to a dictionary. 651 652 **Returns:** 653 654 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 655 656 #### from_dict 657 658 ```python 659 from_dict(data: dict[str, Any]) -> RegexTextExtractor 660 ``` 661 662 Deserializes the component from a dictionary. 663 664 **Parameters:** 665 666 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 667 668 **Returns:** 669 670 - <code>RegexTextExtractor</code> – The deserialized component. 671 672 #### run 673 674 ```python 675 run(text_or_messages: str | list[ChatMessage]) -> dict[str, str] 676 ``` 677 678 Extracts text from input using the configured regex pattern. 679 680 **Parameters:** 681 682 - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through. 683 684 **Returns:** 685 686 - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found 687 - `{"captured_text": ""}` if no match is found 688 689 **Raises:** 690 691 - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.