extractors_api.md
1 --- 2 title: "Extractors" 3 id: extractors-api 4 description: "Components to extract specific elements from textual data." 5 slug: "/extractors-api" 6 --- 7 8 9 ## image/llm_document_content_extractor 10 11 ### LLMDocumentContentExtractor 12 13 Extracts textual content and optionally metadata from image-based documents using a vision-enabled LLM. 14 15 One prompt and one LLM call per document. The component converts each document to an image via 16 DocumentToImageContent and sends it to the ChatGenerator. The prompt must not contain Jinja variables. 17 18 Response handling: 19 20 - If the LLM returns a **plain string** (non-JSON or not a JSON object), it is written to the document's content. 21 - If the LLM returns a **JSON object with only the key** `document_content`, that value is written to content. 22 - If the LLM returns a **JSON object with multiple keys**, the value of `document_content` (if present) is 23 written to content and all other keys are merged into the document's metadata. 24 25 The ChatGenerator can be configured to return JSON (e.g. `response_format={"type": "json_object"}` 26 in `generation_kwargs`). 27 28 Documents that fail extraction are returned in `failed_documents` with `content_extraction_error` in metadata. 29 30 ### Usage example 31 32 ```python 33 from haystack import Document 34 from haystack.components.generators.chat import OpenAIChatGenerator 35 from haystack.components.extractors.image import LLMDocumentContentExtractor 36 37 prompt = """ 38 Extract the content from the provided image. 39 Format everything as markdown. Return only the extracted content as a JSON object with the key 'document_content'. 40 No markdown, no code fence, only raw JSON. 41 42 Extract metadata about the image like source of the image, date of creation, etc. if you can. 43 Return this metadata as additional key-value pairs in the same JSON object. 44 """ 45 46 chat_generator = OpenAIChatGenerator() 47 extractor = LLMDocumentContentExtractor( 48 chat_generator=chat_generator, 49 generation_kwargs={ 50 "response_format": { 51 "type": "json_schema", 52 "json_schema": { 53 "name": "entity_extraction", 54 "schema": { 55 "type": "object", 56 "properties": { 57 "document_content": {"type": "string"}, 58 "author": {"type": "string"}, 59 "date": {"type": "string"}, 60 "document_type": {"type": "string"}, 61 "title": {"type": "string"}, 62 }, 63 "additionalProperties": False, 64 }, 65 }, 66 } 67 } 68 ) 69 documents = [ 70 Document(content="", meta={"file_path": "image.jpg"}), 71 Document(content="", meta={"file_path": "document.pdf", "page_number": 1}) 72 ] 73 result = extractor.run(documents=documents) 74 updated_documents = result["documents"] 75 ``` 76 77 #### __init__ 78 79 ```python 80 __init__( 81 *, 82 chat_generator: ChatGenerator, 83 prompt: str = DEFAULT_PROMPT_TEMPLATE, 84 file_path_meta_field: str = "file_path", 85 root_path: str | None = None, 86 detail: Literal["auto", "high", "low"] | None = None, 87 size: tuple[int, int] | None = None, 88 raise_on_failure: bool = False, 89 max_workers: int = 3 90 ) -> None 91 ``` 92 93 Initialize the LLMDocumentContentExtractor component. 94 95 **Parameters:** 96 97 - **chat_generator** (<code>ChatGenerator</code>) – A ChatGenerator that supports vision input. Optionally configured for JSON 98 (e.g. `response_format={"type": "json_object"}` in `generation_kwargs`). 99 - **prompt** (<code>str</code>) – Prompt for extraction. Must not contain Jinja variables. 100 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 101 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 102 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 103 - **detail** (<code>Literal['auto', 'high', 'low'] | None</code>) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". 104 - **size** (<code>tuple\[int, int\] | None</code>) – If provided, resizes the image to fit within (width, height) while keeping aspect ratio. 105 - **raise_on_failure** (<code>bool</code>) – If True, exceptions from the LLM are raised. If False, failed documents are returned. 106 - **max_workers** (<code>int</code>) – Maximum number of threads for parallel LLM calls. 107 108 #### warm_up 109 110 ```python 111 warm_up() -> None 112 ``` 113 114 Warm up the ChatGenerator if it has a warm_up method. 115 116 #### to_dict 117 118 ```python 119 to_dict() -> dict[str, Any] 120 ``` 121 122 Serializes the component to a dictionary. 123 124 **Returns:** 125 126 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 127 128 #### from_dict 129 130 ```python 131 from_dict(data: dict[str, Any]) -> LLMDocumentContentExtractor 132 ``` 133 134 Deserializes the component from a dictionary. 135 136 **Parameters:** 137 138 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 139 140 **Returns:** 141 142 - <code>LLMDocumentContentExtractor</code> – An instance of the component. 143 144 #### run 145 146 ```python 147 run(documents: list[Document]) -> dict[str, list[Document]] 148 ``` 149 150 Run extraction on image-based documents. One LLM call per document. 151 152 **Parameters:** 153 154 - **documents** (<code>list\[Document\]</code>) – A list of image-based documents to process. Each must have a valid file path in its metadata. 155 156 **Returns:** 157 158 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with "documents" (successfully processed) and "failed_documents" (with failure metadata). 159 160 ## llm_metadata_extractor 161 162 ### LLMMetadataExtractor 163 164 Extracts metadata from documents using a Large Language Model (LLM). 165 166 The metadata is extracted by providing a prompt to an LLM that generates the metadata. 167 168 This component expects as input a list of documents and a prompt. The prompt should have a variable called 169 `document` that will point to a single document in the list of documents. So to access the content of the document, 170 you can use `{{ document.content }}` in the prompt. 171 172 The component will run the LLM on each document in the list and extract metadata from the document. The metadata 173 will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document 174 will be added to the `failed_documents` list. The failed documents will have the keys `metadata_extraction_error` and 175 `metadata_extraction_response` in their metadata. These documents can be re-run with another extractor to 176 extract metadata by using the `metadata_extraction_response` and `metadata_extraction_error` in the prompt. 177 178 ```python 179 from haystack import Document 180 from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor 181 from haystack.components.generators.chat import OpenAIChatGenerator 182 183 NER_PROMPT = ''' 184 -Goal- 185 Given text and a list of entity types, identify all entities of those types from the text. 186 187 -Steps- 188 1. Identify all entities. For each identified entity, extract the following information: 189 - entity: Name of the entity 190 - entity_type: One of the following types: [organization, product, service, industry] 191 Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>} 192 193 2. Return output in a single list with all the entities identified in steps 1. 194 195 -Examples- 196 ###################### 197 Example 1: 198 entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend] 199 text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top 200 10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of 201 our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer 202 base and high cross-border usage. 203 We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership 204 with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global 205 Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the 206 United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent 207 agreement with Emirates Skywards. 208 And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital 209 issuers are equally 210 ------------------------ 211 output: 212 {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]} 213 ############################# 214 -Real Data- 215 ###################### 216 entity_types: [company, organization, person, country, product, service] 217 text: {{ document.content }} 218 ###################### 219 output: 220 ''' 221 222 docs = [ 223 Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"), 224 Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library") 225 ] 226 227 chat_generator = OpenAIChatGenerator( 228 generation_kwargs={ 229 "max_completion_tokens": 500, 230 "temperature": 0.0, 231 "seed": 0, 232 "response_format": { 233 "type": "json_schema", 234 "json_schema": { 235 "name": "entity_extraction", 236 "schema": { 237 "type": "object", 238 "properties": { 239 "entities": { 240 "type": "array", 241 "items": { 242 "type": "object", 243 "properties": { 244 "entity": {"type": "string"}, 245 "entity_type": {"type": "string"} 246 }, 247 "required": ["entity", "entity_type"], 248 "additionalProperties": False 249 } 250 } 251 }, 252 "required": ["entities"], 253 "additionalProperties": False 254 } 255 } 256 }, 257 }, 258 max_retries=1, 259 timeout=60.0, 260 ) 261 262 extractor = LLMMetadataExtractor( 263 prompt=NER_PROMPT, 264 chat_generator=generator, 265 expected_keys=["entities"], 266 raise_on_failure=False, 267 ) 268 269 extractor.run(documents=docs) 270 # >> {'documents': [ 271 # Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', 272 # meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, 273 # {'entity': 'Haystack', 'entity_type': 'product'}]}), 274 # Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library', 275 # meta: {'entities': [ 276 # {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, 277 # {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'} 278 # ]}) 279 # ] 280 # 'failed_documents': [] 281 # } 282 # >> 283 ``` 284 285 #### __init__ 286 287 ```python 288 __init__( 289 prompt: str, 290 chat_generator: ChatGenerator, 291 expected_keys: list[str] | None = None, 292 page_range: list[str | int] | None = None, 293 raise_on_failure: bool = False, 294 max_workers: int = 3, 295 ) -> None 296 ``` 297 298 Initializes the LLMMetadataExtractor. 299 300 **Parameters:** 301 302 - **prompt** (<code>str</code>) – The prompt to be used for the LLM. 303 - **chat_generator** (<code>ChatGenerator</code>) – a ChatGenerator instance which represents the LLM. In order for the component to work, 304 the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you 305 should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`. 306 - **expected_keys** (<code>list\[str\] | None</code>) – The keys expected in the JSON output from the LLM. 307 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 308 metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: 309 ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. 310 If None, metadata will be extracted from the entire document for each document in the documents list. 311 This parameter is optional and can be overridden in the `run` method. 312 - **raise_on_failure** (<code>bool</code>) – Whether to raise an error on failure during the execution of the Generator or 313 validation of the JSON output. 314 - **max_workers** (<code>int</code>) – The maximum number of workers to use in the thread pool executor. 315 316 #### warm_up 317 318 ```python 319 warm_up() -> None 320 ``` 321 322 Warm up the LLM provider component. 323 324 #### to_dict 325 326 ```python 327 to_dict() -> dict[str, Any] 328 ``` 329 330 Serializes the component to a dictionary. 331 332 **Returns:** 333 334 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 335 336 #### from_dict 337 338 ```python 339 from_dict(data: dict[str, Any]) -> LLMMetadataExtractor 340 ``` 341 342 Deserializes the component from a dictionary. 343 344 **Parameters:** 345 346 - **data** (<code>dict\[str, Any\]</code>) – Dictionary with serialized data. 347 348 **Returns:** 349 350 - <code>LLMMetadataExtractor</code> – An instance of the component. 351 352 #### run 353 354 ```python 355 run( 356 documents: list[Document], page_range: list[str | int] | None = None 357 ) -> dict[str, Any] 358 ``` 359 360 Extract metadata from documents using a Large Language Model. 361 362 If `page_range` is provided, the metadata will be extracted from the specified range of pages. This component 363 will split the documents into pages and extract metadata from the specified range of pages. The metadata will be 364 extracted from the entire document if `page_range` is not provided. 365 366 The original documents will be returned updated with the extracted metadata. 367 368 **Parameters:** 369 370 - **documents** (<code>list\[Document\]</code>) – List of documents to extract metadata from. 371 - **page_range** (<code>list\[str | int\] | None</code>) – A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract 372 metadata from the first and third pages of each document. It also accepts printable range 373 strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 374 11, 12. 375 If None, metadata will be extracted from the entire document for each document in the 376 documents list. 377 378 **Returns:** 379 380 - <code>dict\[str, Any\]</code> – A dictionary with the keys: 381 - "documents": A list of documents that were successfully updated with the extracted metadata. 382 - "failed_documents": A list of documents that failed to extract metadata. These documents will have 383 "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be 384 re-run with the extractor to extract metadata. 385 386 ## named_entity_extractor 387 388 ### NamedEntityExtractorBackend 389 390 Bases: <code>Enum</code> 391 392 NLP backend to use for Named Entity Recognition. 393 394 #### from_str 395 396 ```python 397 from_str(string: str) -> NamedEntityExtractorBackend 398 ``` 399 400 Convert a string to a NamedEntityExtractorBackend enum. 401 402 ### NamedEntityAnnotation 403 404 Describes a single NER annotation. 405 406 **Parameters:** 407 408 - **entity** (<code>str</code>) – Entity label. 409 - **start** (<code>int</code>) – Start index of the entity in the document. 410 - **end** (<code>int</code>) – End index of the entity in the document. 411 - **score** (<code>float | None</code>) – Score calculated by the model. 412 413 ### NamedEntityExtractor 414 415 Annotates named entities in a collection of documents. 416 417 The component supports two backends: Hugging Face and spaCy. The 418 former can be used with any sequence classification model from the 419 [Hugging Face model hub](https://huggingface.co/models), while the 420 latter can be used with any [spaCy model](https://spacy.io/models) 421 that contains an NER component. Annotations are stored as metadata 422 in the documents. 423 424 Usage example: 425 426 ```python 427 from haystack import Document 428 from haystack.components.extractors.named_entity_extractor import NamedEntityExtractor 429 430 documents = [ 431 Document(content="I'm Merlin, the happy pig!"), 432 Document(content="My name is Clara and I live in Berkeley, California."), 433 ] 434 extractor = NamedEntityExtractor(backend="hugging_face", model="dslim/bert-base-NER") 435 results = extractor.run(documents=documents)["documents"] 436 annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in results] 437 print(annotations) 438 ``` 439 440 #### __init__ 441 442 ```python 443 __init__( 444 *, 445 backend: str | NamedEntityExtractorBackend, 446 model: str, 447 pipeline_kwargs: dict[str, Any] | None = None, 448 device: ComponentDevice | None = None, 449 token: Secret | None = Secret.from_env_var( 450 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 451 ) 452 ) -> None 453 ``` 454 455 Create a Named Entity extractor component. 456 457 **Parameters:** 458 459 - **backend** (<code>str | NamedEntityExtractorBackend</code>) – Backend to use for NER. 460 - **model** (<code>str</code>) – Name of the model or a path to the model on 461 the local disk. Dependent on the backend. 462 - **pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Keyword arguments passed to the pipeline. The 463 pipeline can override these arguments. Dependent on the backend. 464 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, 465 the default device is automatically selected. If a 466 device/device map is specified in `pipeline_kwargs`, 467 it overrides this parameter (only applicable to the 468 HuggingFace backend). 469 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 470 471 #### warm_up 472 473 ```python 474 warm_up() -> None 475 ``` 476 477 Initialize the component. 478 479 **Raises:** 480 481 - <code>ComponentError</code> – If the backend fails to initialize successfully. 482 483 #### run 484 485 ```python 486 run(documents: list[Document], batch_size: int = 1) -> dict[str, Any] 487 ``` 488 489 Annotate named entities in each document and store the annotations in the document's metadata. 490 491 **Parameters:** 492 493 - **documents** (<code>list\[Document\]</code>) – Documents to process. 494 - **batch_size** (<code>int</code>) – Batch size used for processing the documents. 495 496 **Returns:** 497 498 - <code>dict\[str, Any\]</code> – Processed documents. 499 500 **Raises:** 501 502 - <code>ComponentError</code> – If the backend fails to process a document. 503 504 #### to_dict 505 506 ```python 507 to_dict() -> dict[str, Any] 508 ``` 509 510 Serializes the component to a dictionary. 511 512 **Returns:** 513 514 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 515 516 #### from_dict 517 518 ```python 519 from_dict(data: dict[str, Any]) -> NamedEntityExtractor 520 ``` 521 522 Deserializes the component from a dictionary. 523 524 **Parameters:** 525 526 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 527 528 **Returns:** 529 530 - <code>NamedEntityExtractor</code> – Deserialized component. 531 532 #### initialized 533 534 ```python 535 initialized: bool 536 ``` 537 538 Returns if the extractor is ready to annotate text. 539 540 #### get_stored_annotations 541 542 ```python 543 get_stored_annotations( 544 document: Document, 545 ) -> list[NamedEntityAnnotation] | None 546 ``` 547 548 Returns the document's named entity annotations stored in its metadata, if any. 549 550 **Parameters:** 551 552 - **document** (<code>Document</code>) – Document whose annotations are to be fetched. 553 554 **Returns:** 555 556 - <code>list\[NamedEntityAnnotation\] | None</code> – The stored annotations. 557 558 ## regex_text_extractor 559 560 ### RegexTextExtractor 561 562 Extracts text from chat message or string input using a regex pattern. 563 564 RegexTextExtractor parses input text or ChatMessages using a provided regular expression pattern. 565 It can be configured to search through all messages or only the last message in a list of ChatMessages. 566 567 ### Usage example 568 569 ```python 570 from haystack.components.extractors import RegexTextExtractor 571 from haystack.dataclasses import ChatMessage 572 573 # Using with a string 574 parser = RegexTextExtractor(regex_pattern='<issue url="(.+)">') 575 result = parser.run(text_or_messages='<issue url="github.com/hahahaha">hahahah</issue>') 576 # result: {"captured_text": "github.com/hahahaha"} 577 578 # Using with ChatMessages 579 messages = [ChatMessage.from_user('<issue url="github.com/hahahaha">hahahah</issue>')] 580 result = parser.run(text_or_messages=messages) 581 # result: {"captured_text": "github.com/hahahaha"} 582 ``` 583 584 #### __init__ 585 586 ```python 587 __init__(regex_pattern: str) -> None 588 ``` 589 590 Creates an instance of the RegexTextExtractor component. 591 592 **Parameters:** 593 594 - **regex_pattern** (<code>str</code>) – The regular expression pattern used to extract text. 595 The pattern should include a capture group to extract the desired text. 596 Example: `'<issue url="(.+)">'` captures `'github.com/hahahaha'` from `'<issue url="github.com/hahahaha">'`. 597 598 #### to_dict 599 600 ```python 601 to_dict() -> dict[str, Any] 602 ``` 603 604 Serializes the component to a dictionary. 605 606 **Returns:** 607 608 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 609 610 #### from_dict 611 612 ```python 613 from_dict(data: dict[str, Any]) -> RegexTextExtractor 614 ``` 615 616 Deserializes the component from a dictionary. 617 618 **Parameters:** 619 620 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 621 622 **Returns:** 623 624 - <code>RegexTextExtractor</code> – The deserialized component. 625 626 #### run 627 628 ```python 629 run(text_or_messages: str | list[ChatMessage]) -> dict[str, str] 630 ``` 631 632 Extracts text from input using the configured regex pattern. 633 634 **Parameters:** 635 636 - **text_or_messages** (<code>str | list\[ChatMessage\]</code>) – Either a string or a list of ChatMessage objects to search through. 637 638 **Returns:** 639 640 - <code>dict\[str, str\]</code> – - `{"captured_text": "matched text"}` if a match is found 641 - `{"captured_text": ""}` if no match is found 642 643 **Raises:** 644 645 - <code>TypeError</code> – if receiving a list the last element is not a ChatMessage instance.