extractors_api.md
1 --- 2 title: "Extractors" 3 id: extractors-api 4 description: "Components to extract specific elements from textual data." 5 slug: "/extractors-api" 6 --- 7 8 <a id="image/llm_document_content_extractor"></a> 9 10 ## Module image/llm\_document\_content\_extractor 11 12 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor"></a> 13 14 ### LLMDocumentContentExtractor 15 16 Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model). 17 18 This component converts each input document into an image using the DocumentToImageContent component, 19 uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured 20 textual content based on the provided prompt. 21 22 The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt 23 are passed together to the LLM as a chat message. 24 25 Documents for which the LLM fails to extract content are returned in a separate `failed_documents` list. These 26 failed documents will have a `content_extraction_error` entry in their metadata. This metadata can be used for 27 debugging or for reprocessing the documents later. 28 29 ### Usage example 30 ```python 31 from haystack import Document 32 from haystack.components.generators.chat import OpenAIChatGenerator 33 from haystack.components.extractors.image import LLMDocumentContentExtractor 34 chat_generator = OpenAIChatGenerator() 35 extractor = LLMDocumentContentExtractor(chat_generator=chat_generator) 36 documents = [ 37 Document(content="", meta={"file_path": "image.jpg"}), 38 Document(content="", meta={"file_path": "document.pdf", "page_number": 1}), 39 ] 40 updated_documents = extractor.run(documents=documents)["documents"] 41 print(updated_documents) 42 # [Document(content='Extracted text from image.jpg', 43 # meta={'file_path': 'image.jpg'}), 44 # ...] 45 ``` 46 47 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.__init__"></a> 48 49 #### LLMDocumentContentExtractor.\_\_init\_\_ 50 51 ```python 52 def __init__(*, 53 chat_generator: ChatGenerator, 54 prompt: str = DEFAULT_PROMPT_TEMPLATE, 55 file_path_meta_field: str = "file_path", 56 root_path: str | None = None, 57 detail: Literal["auto", "high", "low"] | None = None, 58 size: tuple[int, int] | None = None, 59 raise_on_failure: bool = False, 60 max_workers: int = 3) 61 ``` 62 63 Initialize the LLMDocumentContentExtractor component. 64 65 **Arguments**: 66 67 - `chat_generator`: A ChatGenerator instance representing the LLM used to extract text. This generator must 68 support vision-based input and return a plain text response. 69 - `prompt`: Instructional text provided to the LLM. It must not contain Jinja variables. 70 The prompt should only contain instructions on how to extract the content of the image-based document. 71 - `file_path_meta_field`: The metadata field in the Document that contains the file path to the image or PDF. 72 - `root_path`: The root directory path where document files are located. If provided, file paths in 73 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 74 - `detail`: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 75 This will be passed to chat_generator when processing the images. 76 - `size`: If provided, resizes the image to fit within the specified dimensions (width, height) while 77 maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial 78 when working with models that have resolution constraints or when transmitting images to remote services. 79 - `raise_on_failure`: If True, exceptions from the LLM are raised. If False, failed documents are logged 80 and returned. 81 - `max_workers`: Maximum number of threads used to parallelize LLM calls across documents using a 82 ThreadPoolExecutor. 83 84 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.warm_up"></a> 85 86 #### LLMDocumentContentExtractor.warm\_up 87 88 ```python 89 def warm_up() 90 ``` 91 92 Warm up the ChatGenerator if it has a warm_up method. 93 94 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.to_dict"></a> 95 96 #### LLMDocumentContentExtractor.to\_dict 97 98 ```python 99 def to_dict() -> dict[str, Any] 100 ``` 101 102 Serializes the component to a dictionary. 103 104 **Returns**: 105 106 Dictionary with serialized data. 107 108 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.from_dict"></a> 109 110 #### LLMDocumentContentExtractor.from\_dict 111 112 ```python 113 @classmethod 114 def from_dict(cls, data: dict[str, Any]) -> "LLMDocumentContentExtractor" 115 ``` 116 117 Deserializes the component from a dictionary. 118 119 **Arguments**: 120 121 - `data`: Dictionary with serialized data. 122 123 **Returns**: 124 125 An instance of the component. 126 127 <a id="image/llm_document_content_extractor.LLMDocumentContentExtractor.run"></a> 128 129 #### LLMDocumentContentExtractor.run 130 131 ```python 132 @component.output_types(documents=list[Document], 133 failed_documents=list[Document]) 134 def run(documents: list[Document]) -> dict[str, list[Document]] 135 ``` 136 137 Run content extraction on a list of image-based documents using a vision-capable LLM. 138 139 Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's 140 content. If the extraction fails, the document is returned in the `failed_documents` list with metadata 141 describing the failure. 142 143 **Arguments**: 144 145 - `documents`: A list of image-based documents to process. Each must have a valid file path in its metadata. 146 147 **Returns**: 148 149 A dictionary with: 150 - "documents": Successfully processed documents, updated with extracted content. 151 - "failed_documents": Documents that failed processing, annotated with failure metadata. 152 153 <a id="llm_metadata_extractor"></a> 154 155 ## Module llm\_metadata\_extractor 156 157 <a id="llm_metadata_extractor.LLMMetadataExtractor"></a> 158 159 ### LLMMetadataExtractor 160 161 Extracts metadata from documents using a Large Language Model (LLM). 162 163 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 164 165 This component expects as input a list of documents and a prompt. The prompt should have a variable called 166 `document` that will point to a single document in the list of documents. So to access the content of the document, 167 you can use `{{ document.content }}` in the prompt. 168 169 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 170 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 171 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 172 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 173 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 174 175 ```python 176 from haystack import Document 177 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 178 from haystack.components.generators.chat import OpenAIChatGenerator 179 180 NER_PROMPT = ''' 181 -Goal- 182 Given text and a list of entity types, identify all entities of those types from the text. 183 184 -Steps- 185 1. Identify all entities. For each identified entity, extract the following information: 186 - entity: Name of the entity 187 - entity_type: One of the following types: [organization, product, service, industry] 188 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 189 190 2. Return output in a single list with all the entities identified in steps 1. 191 192 -Examples- 193 ###################### 194 Example 1: 195 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 196 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 197 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 198 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 199 base and high cross-border usage. 200 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 201 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 202 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 203 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 204 agreement with Emirates Skywards. 205 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 206 issuers are equally 207 ------------------------ 208 output: 209 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 210 ############################# 211 -Real Data- 212 ###################### 213 entity_types: [company, organization, person, country, product, service] 214 text: {{ document.content }} 215 ###################### 216 output: 217 ''' 218 219 docs = [ 220 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 221 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 222 ] 223 224 chat_generator = OpenAIChatGenerator( 225 generation_kwargs={ 226 "max_completion_tokens": 500, 227 "temperature": 0.0, 228 "seed": 0, 229 "response_format": {"type": "json_object"}, 230 }, 231 max_retries=1, 232 timeout=60.0, 233 ) 234 235 extractor = LLMMetadataExtractor( 236 prompt=NER_PROMPT, 237 chat_generator=generator, 238 expected_keys=["entities"], 239 raise_on_failure=False, 240 ) 241 242 extractor.warm_up() 243 extractor.run(documents=docs) 244 >> {'documents': [ 245 Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 246 meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 247 {'entity': 'Haystack', 'entity_type': 'product'}]}), 248 Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 249 meta: {'entities': [ 250 {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 251 {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 252 ]}) 253 ] 254 'failed_documents': [] 255 } 256 >> 257 ``` 258 259 <a id="llm_metadata_extractor.LLMMetadataExtractor.__init__"></a> 260 261 #### LLMMetadataExtractor.\_\_init\_\_ 262 263 ```python 264 def __init__(prompt: str, 265 chat_generator: ChatGenerator, 266 expected_keys: list[str] | None = None, 267 page_range: list[str | int] | None = None, 268 raise_on_failure: bool = False, 269 max_workers: int = 3) 270 ``` 271 272 Initializes the LLMMetadataExtractor. 273 274 **Arguments**: 275 276 - `prompt`: The prompt to be used for the LLM. 277 - `chat_generator`: a ChatGenerator instance which represents the LLM. In order for the component to work, 278 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 279 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 280 - `expected_keys`: The keys expected in the JSON output from the LLM. 281 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 282 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 283 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 284 If None, metadata will be extracted from the entire document for each document in the documents list. 285 This parameter is optional and can be overridden in the `run` method. 286 - `raise_on_failure`: Whether to raise an error on failure during the execution of the Generator or 287 validation of the JSON output. 288 - `max_workers`: The maximum number of workers to use in the thread pool executor. 289 290 <a id="llm_metadata_extractor.LLMMetadataExtractor.warm_up"></a> 291 292 #### LLMMetadataExtractor.warm\_up 293 294 ```python 295 def warm_up() 296 ``` 297 298 Warm up the LLM provider component. 299 300 <a id="llm_metadata_extractor.LLMMetadataExtractor.to_dict"></a> 301 302 #### LLMMetadataExtractor.to\_dict 303 304 ```python 305 def to_dict() -> dict[str, Any] 306 ``` 307 308 Serializes the component to a dictionary. 309 310 **Returns**: 311 312 Dictionary with serialized data. 313 314 <a id="llm_metadata_extractor.LLMMetadataExtractor.from_dict"></a> 315 316 #### LLMMetadataExtractor.from\_dict 317 318 ```python 319 @classmethod 320 def from_dict(cls, data: dict[str, Any]) -> "LLMMetadataExtractor" 321 ``` 322 323 Deserializes the component from a dictionary. 324 325 **Arguments**: 326 327 - `data`: Dictionary with serialized data. 328 329 **Returns**: 330 331 An instance of the component. 332 333 <a id="llm_metadata_extractor.LLMMetadataExtractor.run"></a> 334 335 #### LLMMetadataExtractor.run 336 337 ```python 338 @component.output_types(documents=list[Document], 339 failed_documents=list[Document]) 340 def run(documents: list[Document], page_range: list[str | int] | None = None) 341 ``` 342 343 Extract metadata from documents using a Large Language Model. 344 345 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 346 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 347 extracted from the entire document if `page_range` is not provided. 348 349 The original documents will be returned updated with the extracted metadata. 350 351 **Arguments**: 352 353 - `documents`: List of documents to extract metadata from. 354 - `page_range`: A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 355 metadata from the first and third pages of each document. It also accepts printable range 356 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 357 11, 12. 358 If None, metadata will be extracted from the entire document for each document in the 359 documents list. 360 361 **Returns**: 362 363 A dictionary with the keys: 364 - "documents": A list of documents that were successfully updated with the extracted metadata. 365 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 366 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 367 re-run with the extractor to extract metadata. 368 369 <a id="named_entity_extractor"></a> 370 371 ## Module named\_entity\_extractor 372 373 <a id="named_entity_extractor.NamedEntityExtractorBackend"></a> 374 375 ### NamedEntityExtractorBackend 376 377 NLP backend to use for Named Entity Recognition. 378 379 <a id="named_entity_extractor.NamedEntityExtractorBackend.HUGGING_FACE"></a> 380 381 #### HUGGING\_FACE 382 383 Uses an Hugging Face model and pipeline. 384 385 <a id="named_entity_extractor.NamedEntityExtractorBackend.SPACY"></a> 386 387 #### SPACY 388 389 Uses a spaCy model and pipeline. 390 391 <a id="named_entity_extractor.NamedEntityExtractorBackend.from_str"></a> 392 393 #### NamedEntityExtractorBackend.from\_str 394 395 ```python 396 @staticmethod 397 def from_str(string: str) -> "NamedEntityExtractorBackend" 398 ``` 399 400 Convert a string to a NamedEntityExtractorBackend enum. 401 402 <a id="named_entity_extractor.NamedEntityAnnotation"></a> 403 404 ### NamedEntityAnnotation 405 406 Describes a single NER annotation. 407 408 **Arguments**: 409 410 - `entity`: Entity label. 411 - `start`: Start index of the entity in the document. 412 - `end`: End index of the entity in the document. 413 - `score`: Score calculated by the model. 414 415 <a id="named_entity_extractor.NamedEntityExtractor"></a> 416 417 ### NamedEntityExtractor 418 419 Annotates named entities in a collection of documents. 420 421 The component supports two backends: Hugging Face and spaCy. The 422 former can be used with any sequence classification model from the 423 [Hugging Face model hub](https://huggingface.co/models), while the 424 latter can be used with any [spaCy model](https://spacy.io/models) 425 that contains an NER component. Annotations are stored as metadata 426 in the documents. 427 428 Usage example: 429 ```python 430 from haystack import Document 431 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 432 433 documents = [ 434 Document(content="I'm Merlin, the happy pig!"), 435 Document(content="My name is Clara and I live in Berkeley, California."), 436 ] 437 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 438 extractor.warm_up() 439 results = extractor.run(documents=documents)["documents"] 440 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 441 print(annotations) 442 ``` 443 444 <a id="named_entity_extractor.NamedEntityExtractor.__init__"></a> 445 446 #### NamedEntityExtractor.\_\_init\_\_ 447 448 ```python 449 def __init__( 450 *, 451 backend: str | NamedEntityExtractorBackend, 452 model: str, 453 pipeline_kwargs: dict[str, Any] | None = None, 454 device: ComponentDevice | None = None, 455 token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], 456 strict=False) 457 ) -> None 458 ``` 459 460 Create a Named Entity extractor component. 461 462 **Arguments**: 463 464 - `backend`: Backend to use for NER. 465 - `model`: Name of the model or a path to the model on 466 the local disk. Dependent on the backend. 467 - `pipeline_kwargs`: Keyword arguments passed to the pipeline. The 468 pipeline can override these arguments. Dependent on the backend. 469 - `device`: The device on which the model is loaded. If `None`, 470 the default device is automatically selected. If a 471 device/device map is specified in `pipeline_kwargs`, 472 it overrides this parameter (only applicable to the 473 HuggingFace backend). 474 - `token`: The API token to download private models from Hugging Face. 475 476 <a id="named_entity_extractor.NamedEntityExtractor.warm_up"></a> 477 478 #### NamedEntityExtractor.warm\_up 479 480 ```python 481 def warm_up() 482 ``` 483 484 Initialize the component. 485 486 **Raises**: 487 488 - `ComponentError`: If the backend fails to initialize successfully. 489 490 <a id="named_entity_extractor.NamedEntityExtractor.run"></a> 491 492 #### NamedEntityExtractor.run 493 494 ```python 495 @component.output_types(documents=list[Document]) 496 def run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 497 ``` 498 499 Annotate named entities in each document and store the annotations in the document's metadata. 500 501 **Arguments**: 502 503 - `documents`: Documents to process. 504 - `batch_size`: Batch size used for processing the documents. 505 506 **Raises**: 507 508 - `ComponentError`: If the backend fails to process a document. 509 510 **Returns**: 511 512 Processed documents. 513 514 <a id="named_entity_extractor.NamedEntityExtractor.to_dict"></a> 515 516 #### NamedEntityExtractor.to\_dict 517 518 ```python 519 def to_dict() -> dict[str, Any] 520 ``` 521 522 Serializes the component to a dictionary. 523 524 **Returns**: 525 526 Dictionary with serialized data. 527 528 <a id="named_entity_extractor.NamedEntityExtractor.from_dict"></a> 529 530 #### NamedEntityExtractor.from\_dict 531 532 ```python 533 @classmethod 534 def from_dict(cls, data: dict[str, Any]) -> "NamedEntityExtractor" 535 ``` 536 537 Deserializes the component from a dictionary. 538 539 **Arguments**: 540 541 - `data`: Dictionary to deserialize from. 542 543 **Returns**: 544 545 Deserialized component. 546 547 <a id="named_entity_extractor.NamedEntityExtractor.initialized"></a> 548 549 #### NamedEntityExtractor.initialized 550 551 ```python 552 @property 553 def initialized() -> bool 554 ``` 555 556 Returns if the extractor is ready to annotate text. 557 558 <a id="named_entity_extractor.NamedEntityExtractor.get_stored_annotations"></a> 559 560 #### NamedEntityExtractor.get\_stored\_annotations 561 562 ```python 563 @classmethod 564 def get_stored_annotations( 565 cls, document: Document) -> list[NamedEntityAnnotation] | None 566 ``` 567 568 Returns the document's named entity annotations stored in its metadata, if any. 569 570 **Arguments**: 571 572 - `document`: Document whose annotations are to be fetched. 573 574 **Returns**: 575 576 The stored annotations. 577 578 <a id="regex_text_extractor"></a> 579 580 ## Module regex\_text\_extractor 581 582 <a id="regex_text_extractor.RegexTextExtractor"></a> 583 584 ### RegexTextExtractor 585 586 Extracts text from chat message or string input using a regex pattern. 587 588 RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern. 589 It can be configured to search through all messages or only the last message in a list of ChatMessages. 590 591 ### Usage example 592 593 ```python 594 from haystack.components.extractors import RegexTextExtractor 595 from haystack.dataclasses import ChatMessage 596 597 # Using with a string 598 parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">') 599 result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>') 600 # result: {"captured_text": "github.com/hahahaha"} 601 602 # Using with ChatMessages 603 messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')] 604 result = parser.run(text_or_messages=messages) 605 # result: {"captured_text": "github.com/hahahaha"} 606 ``` 607 608 <a id="regex_text_extractor.RegexTextExtractor.__init__"></a> 609 610 #### RegexTextExtractor.\_\_init\_\_ 611 612 ```python 613 def __init__(regex_pattern: str) 614 ``` 615 616 Creates an instance of the RegexTextExtractor component. 617 618 **Arguments**: 619 620 - `regex_pattern`: The regular expression pattern used to extract text. 621 The pattern should include a capture group to extract the desired text. 622 Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`. 623 624 <a id="regex_text_extractor.RegexTextExtractor.run"></a> 625 626 #### RegexTextExtractor.run 627 628 ```python 629 @component.output_types(captured_text=str) 630 def run(text_or_messages: str | list[ChatMessage]) -> dict[str, str] 631 ``` 632 633 Extracts text from input using the configured regex pattern. 634 635 **Arguments**: 636 637 - `text_or_messages`: Either a string or a list of ChatMessage objects to search through. 638 639 **Raises**: 640 641 - `None`: - ValueError: if receiving a list the last element is not a ChatMessage instance. 642 643 **Returns**: 644 645 - `{"captured_text": "matched text"}` if a match is found 646 - `{"captured_text": ""}` if no match is found 647