extractors_api.md
1 --- 2 title: "Extractors" 3 id: extractors-api 4 description: "Components to extract specific elements from textual data." 5 slug: "/extractors-api" 6 --- 7 8 9 ## image/llm_document_content_extractor 10 11 ### LLMDocumentContentExtractor 12 13 Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM. 14 15 One prompt and one LLM call per document. The component converts each document to an image via 16 DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables. 17 18 Response handling: 19 20 - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content. 21 - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content. 22 - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is 23 written to content and all other keys are merged into the document's metadata. 24 25 The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}` 26 in `generation_kwargs`). 27 28 Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata. 29 30 ### Usage example 31 32 ```python 33 from haystack import Document 34 from haystack.components.generators.chat import OpenAIChatGenerator 35 from haystack.components.extractors.image import LLMDocumentContentExtractor 36 37 prompt = """ 38 Extract the content from the provided image. 39 Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'. 40 No markdown, no code fence, only raw JSON. 41 42 Extract metadata about the image like source of the image, date of creation, etc. if you can. 43 Return this metadata as additional key-value pairs in the same JSON object. 44 """ 45 46 chat_generator = OpenAIChatGenerator() 47 extractor = LLMDocumentContentExtractor( 48 chat_generator=chat_generator, 49 generation_kwargs={ 50 "response_format": { 51 "type": "json_schema", 52 "json_schema": { 53 "name": "entity_extraction", 54 "schema": { 55 "type": "object", 56 "properties": { 57 "document_content": {"type": "string"}, 58 "author": {"type": "string"}, 59 "date": {"type": "string"}, 60 "document_type": {"type": "string"}, 61 "title": {"type": "string"}, 62 }, 63 "additionalProperties": False, 64 }, 65 }, 66 } 67 } 68 ) 69 documents = [ 70 Document(content="", meta={"file_path": "image.jpg"}), 71 Document(content="", meta={"file_path": "document.pdf", "page_number": 1}) 72 ] 73 result = extractor.run(documents=documents) 74 updated_documents = result["documents"] 75 ``` 76 77 #### __init__ 78 79 ```python 80 __init__( 81 *, 82 chat_generator: ChatGenerator, 83 prompt: str = DEFAULT_PROMPT_TEMPLATE, 84 file_path_meta_field: str = "file_path", 85 root_path: str | None = None, 86 detail: Literal["auto", "high", "low"] | None = None, 87 size: tuple[int, int] | None = None, 88 raise_on_failure: bool = False, 89 max_workers: int = 3 90 ) 91 ``` 92 93 Initialize the LLMDocumentContentExtractor component. 94 95 **Parameters:** 96 97 - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON 98 (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`). 99 - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables. 100 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 101 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 102 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 103 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 104 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio. 105 - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned. 106 - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls. 107 108 #### warm_up 109 110 ```python 111 warm_up() 112 ``` 113 114 Warm up the ChatGenerator if it has a warm_up method. 115 116 #### to_dict 117 118 ```python 119 to_dict() -> dict[str, Any] 120 ``` 121 122 Serializes the component to a dictionary. 123 124 **Returns:** 125 126 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 127 128 #### from_dict 129 130 ```python 131 from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor 132 ``` 133 134 Deserializes the component from a dictionary. 135 136 **Parameters:** 137 138 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 139 140 **Returns:** 141 142 - <code>LLMDocumentContentExtractor</code> – An instance of the component. 143 144 #### run 145 146 ```python 147 run(documents: list[Document]) -> dict[str, list[Document]] 148 ``` 149 150 Run extraction on image-based documents. One LLM call per document. 151 152 **Parameters:** 153 154 - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata. 155 156 **Returns:** 157 158 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata). 159 160 ## llm_metadata_extractor 161 162 ### LLMMetadataExtractor 163 164 Extracts metadata from documents using a Large Language Model (LLM). 165 166 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 167 168 This component expects as input a list of documents and a prompt. The prompt should have a variable called 169 `document` that will point to a single document in the list of documents. So to access the content of the document, 170 you can use `{{ document.content }}` in the prompt. 171 172 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 173 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 174 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 175 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 176 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 177 178 ```python 179 from haystack import Document 180 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 181 from haystack.components.generators.chat import OpenAIChatGenerator 182 183 NER_PROMPT = ''' 184 -Goal- 185 Given text and a list of entity types, identify all entities of those types from the text. 186 187 -Steps- 188 1. Identify all entities. For each identified entity, extract the following information: 189 - entity: Name of the entity 190 - entity_type: One of the following types: [organization, product, service, industry] 191 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 192 193 2. Return output in a single list with all the entities identified in steps 1. 194 195 -Examples- 196 ###################### 197 Example 1: 198 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 199 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 200 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 201 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 202 base and high cross-border usage. 203 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 204 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 205 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 206 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 207 agreement with Emirates Skywards. 208 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 209 issuers are equally 210 ------------------------ 211 output: 212 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 213 ############################# 214 -Real Data- 215 ###################### 216 entity_types: [company, organization, person, country, product, service] 217 text: {{ document.content }} 218 ###################### 219 output: 220 ''' 221 222 docs = [ 223 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 224 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 225 ] 226 227 chat_generator = OpenAIChatGenerator( 228 generation_kwargs={ 229 "max_completion_tokens": 500, 230 "temperature": 0.0, 231 "seed": 0, 232 "response_format": { 233 "type": "json_schema", 234 "json_schema": { 235 "name": "entity_extraction", 236 "schema": { 237 "type": "object", 238 "properties": { 239 "entities": { 240 "type": "array", 241 "items": { 242 "type": "object", 243 "properties": { 244 "entity": {"type": "string"}, 245 "entity_type": {"type": "string"} 246 }, 247 "required": ["entity", "entity_type"], 248 "additionalProperties": False 249 } 250 } 251 }, 252 "required": ["entities"], 253 "additionalProperties": False 254 } 255 } 256 }, 257 }, 258 max_retries=1, 259 timeout=60.0, 260 ) 261 262 extractor = LLMMetadataExtractor( 263 prompt=NER_PROMPT, 264 chat_generator=generator, 265 expected_keys=["entities"], 266 raise_on_failure=False, 267 ) 268 269 extractor.run(documents=docs) 270 >> {'documents': [ 271 Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 272 meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 273 {'entity': 'Haystack', 'entity_type': 'product'}]}), 274 Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 275 meta: {'entities': [ 276 {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 277 {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 278 ]}) 279 ] 280 'failed_documents': [] 281 } 282 >> 283 ``` 284 285 #### __init__ 286 287 ```python 288 __init__( 289 prompt: str, 290 chat_generator: ChatGenerator, 291 expected_keys: list[str] | None = None, 292 page_range: list[str | int] | None = None, 293 raise_on_failure: bool = False, 294 max_workers: int = 3, 295 ) 296 ``` 297 298 Initializes the LLMMetadataExtractor. 299 300 **Parameters:** 301 302 - **prompt** (<code>str</code>) – The prompt to be used for the LLM. 303 - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work, 304 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 305 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 306 - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM. 307 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 308 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 309 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 310 If None, metadata will be extracted from the entire document for each document in the documents list. 311 This parameter is optional and can be overridden in the `run` method. 312 - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or 313 validation of the JSON output. 314 - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor. 315 316 #### warm_up 317 318 ```python 319 warm_up() 320 ``` 321 322 Warm up the LLM provider component. 323 324 #### to_dict 325 326 ```python 327 to_dict() -> dict[str, Any] 328 ``` 329 330 Serializes the component to a dictionary. 331 332 **Returns:** 333 334 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 335 336 #### from_dict 337 338 ```python 339 from_dict(data: dict[str, Any]) -> LLMMetadataExtractor 340 ``` 341 342 Deserializes the component from a dictionary. 343 344 **Parameters:** 345 346 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 347 348 **Returns:** 349 350 - <code>LLMMetadataExtractor</code> – An instance of the component. 351 352 #### run 353 354 ```python 355 run(documents: list[Document], page_range: list[str | int] | None = None) 356 ``` 357 358 Extract metadata from documents using a Large Language Model. 359 360 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 361 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 362 extracted from the entire document if `page_range` is not provided. 363 364 The original documents will be returned updated with the extracted metadata. 365 366 **Parameters:** 367 368 - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from. 369 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 370 metadata from the first and third pages of each document. It also accepts printable range 371 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 372 11, 12. 373 If None, metadata will be extracted from the entire document for each document in the 374 documents list. 375 376 **Returns:** 377 378 - – A dictionary with the keys: 379 - "documents": A list of documents that were successfully updated with the extracted metadata. 380 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 381 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 382 re-run with the extractor to extract metadata. 383 384 ## named_entity_extractor 385 386 ### NamedEntityExtractorBackend 387 388 Bases: <code>Enum</code> 389 390 NLP backend to use for Named Entity Recognition. 391 392 #### from_str 393 394 ```python 395 from_str(string: str) -> NamedEntityExtractorBackend 396 ``` 397 398 Convert a string to a NamedEntityExtractorBackend enum. 399 400 ### NamedEntityAnnotation 401 402 Describes a single NER annotation. 403 404 **Parameters:** 405 406 - **entity** (<code>str</code>) – Entity label. 407 - **start** (<code>int</code>) – Start index of the entity in the document. 408 - **end** (<code>int</code>) – End index of the entity in the document. 409 - **score** (<code>float | None</code>) – Score calculated by the model. 410 411 ### NamedEntityExtractor 412 413 Annotates named entities in a collection of documents. 414 415 The component supports two backends: Hugging Face and spaCy. The 416 former can be used with any sequence classification model from the 417 [Hugging Face model hub](https://huggingface.co/models), while the 418 latter can be used with any [spaCy model](https://spacy.io/models) 419 that contains an NER component. Annotations are stored as metadata 420 in the documents. 421 422 Usage example: 423 424 ```python 425 from haystack import Document 426 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 427 428 documents = [ 429 Document(content="I'm Merlin, the happy pig!"), 430 Document(content="My name is Clara and I live in Berkeley, California."), 431 ] 432 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 433 results = extractor.run(documents=documents)["documents"] 434 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 435 print(annotations) 436 ``` 437 438 #### __init__ 439 440 ```python 441 __init__( 442 *, 443 backend: str | NamedEntityExtractorBackend, 444 model: str, 445 pipeline_kwargs: dict[str, Any] | None = None, 446 device: ComponentDevice | None = None, 447 token: Secret | None = Secret.from_env_var( 448 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 449 ) 450 ) -> None 451 ``` 452 453 Create a Named Entity extractor component. 454 455 **Parameters:** 456 457 - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER. 458 - **model** (<code>str</code>) – Name of the model or a path to the model on 459 the local disk. Dependent on the backend. 460 - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The 461 pipeline can override these arguments. Dependent on the backend. 462 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, 463 the default device is automatically selected. If a 464 device/device map is specified in `pipeline_kwargs`, 465 it overrides this parameter (only applicable to the 466 HuggingFace backend). 467 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 468 469 #### warm_up 470 471 ```python 472 warm_up() 473 ``` 474 475 Initialize the component. 476 477 **Raises:** 478 479 - <code>ComponentError</code> – If the backend fails to initialize successfully. 480 481 #### run 482 483 ```python 484 run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 485 ``` 486 487 Annotate named entities in each document and store the annotations in the document's metadata. 488 489 **Parameters:** 490 491 - **documents** (<code>list\[Document\]</code>) – Documents to process. 492 - **batch_size** (<code>int</code>) – Batch size used for processing the documents. 493 494 **Returns:** 495 496 - <code>dict\[str, Any\]</code> – Processed documents. 497 498 **Raises:** 499 500 - <code>ComponentError</code> – If the backend fails to process a document. 501 502 #### to_dict 503 504 ```python 505 to_dict() -> dict[str, Any] 506 ``` 507 508 Serializes the component to a dictionary. 509 510 **Returns:** 511 512 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 513 514 #### from_dict 515 516 ```python 517 from_dict(data: dict[str, Any]) -> NamedEntityExtractor 518 ``` 519 520 Deserializes the component from a dictionary. 521 522 **Parameters:** 523 524 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 525 526 **Returns:** 527 528 - <code>NamedEntityExtractor</code> – Deserialized component. 529 530 #### initialized 531 532 ```python 533 initialized: bool 534 ``` 535 536 Returns if the extractor is ready to annotate text. 537 538 #### get_stored_annotations 539 540 ```python 541 get_stored_annotations( 542 document: Document, 543 ) -> list[NamedEntityAnnotation] | None 544 ``` 545 546 Returns the document's named entity annotations stored in its metadata, if any. 547 548 **Parameters:** 549 550 - **document** (<code>Document</code>) – Document whose annotations are to be fetched. 551 552 **Returns:** 553 554 - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations. 555 556 ## regex_text_extractor 557 558 ### RegexTextExtractor 559 560 Extracts text from chat message or string input using a regex pattern. 561 562 RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern. 563 It can be configured to search through all messages or only the last message in a list of ChatMessages. 564 565 ### Usage example 566 567 ```python 568 from haystack.components.extractors import RegexTextExtractor 569 from haystack.dataclasses import ChatMessage 570 571 # Using with a string 572 parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">') 573 result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>') 574 # result: {"captured_text": "github.com/hahahaha"} 575 576 # Using with ChatMessages 577 messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')] 578 result = parser.run(text_or_messages=messages) 579 # result: {"captured_text": "github.com/hahahaha"} 580 ``` 581 582 #### __init__ 583 584 ```python 585 __init__(regex_pattern: str) 586 ``` 587 588 Creates an instance of the RegexTextExtractor component. 589 590 **Parameters:** 591 592 - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text. 593 The pattern should include a capture group to extract the desired text. 594 Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`. 595 596 #### to_dict 597 598 ```python 599 to_dict() -> dict[str, Any] 600 ``` 601 602 Serializes the component to a dictionary. 603 604 **Returns:** 605 606 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 607 608 #### from_dict 609 610 ```python 611 from_dict(data: dict[str, Any]) -> RegexTextExtractor 612 ``` 613 614 Deserializes the component from a dictionary. 615 616 **Parameters:** 617 618 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 619 620 **Returns:** 621 622 - <code>RegexTextExtractor</code> – The deserialized component. 623 624 #### run 625 626 ```python 627 run(text_or_messages: str | list[ChatMessage]) -> dict[str, str] 628 ``` 629 630 Extracts text from input using the configured regex pattern. 631 632 **Parameters:** 633 634 - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through. 635 636 **Returns:** 637 638 - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found 639 - `{"captured_text": ""}` if no match is found 640 641 **Raises:** 642 643 - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.